Final Report on a Survey of Aviation English Tests - ealta - EU.org
On Common Ground? How Raters Perceive Scoring Criteria in Oral Proficiency Testing 3rd Annual...
-
Upload
beryl-anderson -
Category
Documents
-
view
237 -
download
0
Transcript of On Common Ground? How Raters Perceive Scoring Criteria in Oral Proficiency Testing 3rd Annual...
On Common Ground?How Raters Perceive Scoring Criteria
in Oral Proficiency Testing
3rd Annual Conference of EALTA
Krakow, Poland, 19-21 May 2006
Thomas EckesTestDaF Institute, Hagen, Germany
Overview
1. Rater Variability2. Rater Type Hypothesis
2.1 Questionnaire Study2.2 Raters and Criteria
3. Facets Analysis4. Cluster Analysis5. Summary and Discussion
Rater variability
In Rater-Mediated Assessments, Rater Variability
contributes to variance in observed ratings that is associated with raters and not with examinees
obscures the construct being measured threatens the validity and fairness of scores
awarded to examinees
Rater variability
Rater variability comes in many forms, eg: rater differences in severity or leniency
differences in the understanding and use of rating scale categories
differences in the kind of performance features raters attend to, or attach importance to
interactions between examinees, raters, and tasks
Rater variability
Approaches to Rater Variability in Oral Proficiency Testing
Many-facet Rasch measurement (eg Bonk & Ockey, 2003; Lumley & McNamara, 1995)
Generalizability theory (Bachman, Lynch & Mason, 1995; Lee, 2006)
Other scaling approaches, eg INDSCAL (Chalhoub-Deville, 1995), Grid-Technique/Thurstone-Scaling (Pollitt & Murray, 1996)
Rater variability
Approaches to Rater Variability in Oral Proficiency Testing contd
Discourse analysis (eg Brown, 2005; Meiron & Schick, 2000)
Verbal protocol analysis (Brown, Iwashita & McNamara, 2005)
Rater variability
Study Assessment N Raters Sep. Rel.
Bachman, Lynch, & Mason (1995)
Spanish (LAAS) 15 .92
Brown (2005)IELTS Speaking
Module6 .64
Eckes (2005)TestDaF
Speaking Section31 .98
Lumley & McNamara (1995)
OET Speaking Section
13 .89
Lynch & McNamara (1998)
Speaking Skills Module (access:)
4 1.00
Many-facet Rasch measurement studies
Rater variability
Two recent qualitative studies
Brown (2005) – IELTS Speaking Module
Interviewers varied along a number of dimensions (eg ways to deploy topics, elicitation techniques, interactional style)
Raters varied in their views of how interviewers‘ behavior influenced examinees‘ performance (having impact on ratings)
Rater variability
Brown, Iwashita & McNamara (2005) – new TOEFL project speaking tasks
Raters attended to 4 general categories (ie linguistic resources, phonology, fluency, content)
Within each category, raters considered a range of specific performance features (eg linguistic resources: grammar, vocabulary, expression, textualization)
Some evidence for rater disagreement (eg reuse of input text, disfluency associated with repair)
Content as a major focus in EAP tasks
Rater variability
Conclusions From Prior Research
Rater variability is substantial, even among trained, experienced raters
Differences in the interpretation and use of scoring criteria are an important source of rater variability
Rater variability
Raters appear to have inbuilt perceptions of what is acceptable to them . . . . even the explicitness of the descriptors and the standardization that takes place in a training session cannot remove these differences.
A. Brown (1995, p. 13)
Rater types
Rater Type Hypothesis
Experienced raters fall into types (classes, clusters) that are characterized by distinctive patterns of criterion perception.
Raters types are organized into a rater taxonomy, that is, into a hierarchical classification system relating raters and criteria to one another.
Rater types
Rater Type Hypothesis contd
Previous research in a writing proficiency context (Eckes, 2006) provided support for the hypothesis
Extensions in the present research: Speaking proficiency Three perceptual dimensions:
(1) Importance (2) Ease-of-application (3) Confidence
Rater types
Questionnaire Study
Background: The TestDaF
A large-scale and high-stakes test Designed for foreign students applying for entry to
an institution of higher education in Germany Measures German language proficiency at an
intermediate to high level Examines the four language skills in separate
sections In each section, task and item content are closely
related to the academic context
Rater types
Background contd: Speaking section
Performance-based test SOPI format (indirect speaking test) Multitrait scoring rubric Tasks are presented orally from tape (or CD) and in
print Responses are recorded on a second tape (or CD) Seven tasks at variable levels of difficulty (one
warm-up task) Range of common university situations (eg
discussing with fellow students, describing a diagram during a tutorial, forming hypotheses)
Rater types
Participants
Speaking section sample: 53 raters (15 men, 38 women)
Mean age: 47.7 years (SD = 9.6) 81% with 10 or more years as a DaF teacher 83% with 4 or more years as a DaF examiner Rater monitoring covered a 3-year period, including
11 TestDaF scoring sessions Raters generally manifested high degrees of scoring
proficiency
Rater types
Criterion Scale descriptor (top level)
1 Comprehensibility can be understood phonetically without difficulty
2 Content content is comprehensible in every respect
3 Vocabulary vocabulary is wide
4 Correctness errors seldom occur
5 Adequacy choice of linguistic means is appropriate
6 Completeness points are dealt with sufficiently
7 Description important information is summarized logically
8 Discussion advantages and disadvantages are discussed
9 Standpoint own point of view is stated conclusively
Rater types
Ratings of criteria along 3 scales:
Importance Ease-of-application Confidence
4-point scales, eg importance scale:
1 - Less important2 - Important3 - Very important4 - Extremely important
Facets analysis
Data analysis, Part I: Facets analysis
Research questions Degree of variability in raters‘ perceptions Degree of variability in perceived criteria Functioning of the 4-point rating scales
Two facets Raters (53) Criteria (9)
Program: FACETS (Version 3.59; Linacre, 2005)
Variable Map
Importance
Overall model fit ok
Separation reliability(a) Raters: .85(b) Criteria: .94
Chi-square homogeneity statistic highly significant (p < .01)
Logit Rater Criterion Scale
Attaching High Importance Highly Important
4 34 (4) 29 3 45 06 17 content 2 ----- 03 26 32 16 37 1 10 13 25 35 36 42 01 38 04 21 41 51 54 3 20 43 standpoint 30 53 comprehensibility 0 description 08 18 27 33 40 50 ----- 12 48 discussion 22 39 47 49 completeness vocabulary adequacy 05 19 28 correctness 2
–1 11 23 44 14 24 07 09 46 -----
–2 02 15 31
–3 (1) Attaching Low Importance Less Important
Variable Map
Ease-of-application
Overall model fit ok
Separation reliability(a) Raters: .79(b) Criteria: .42
Chi-square homogeneity statistic highly significant (p < .01)
Logit Rater Criterion Scale
Experiencing Much Ease Highly Easy
2 (4) 36 45 ----- 29 32 1 26 35 50 53 01 30 3 13 24 33 38 43 40 standpoint 10 25 correctness 09 16 27 37 41 completeness vocabulary ----- 0 15 comprehensibility content discussion 06 07 08 12 39 adequacy 05 28 42 02 14 20 21 22 31 48 49 description 34 2 03 04 51
–1 11 19 23 46 47 ----- 52
–2 44
–3 18
–4 17 (1) Experiencing Less Ease Not So Easy
Variable Map
Confidence
Overall model fit ok
Separation reliability(a) Raters: .81(b) Criteria: .26
Chi-square homogeneity statistic highly significant (p < .01; only raters)
Logit Rater Criterion Scale
Highly Confident Used With Much Confidence
3 (4) 06 29 36 49 2 ----- 54 24 43 45 50 26 32 1 25 35 01 08 09 33 38 3 03 13 16 27 34 42 comprehensibility 07 39 completeness standpoint content discussion 0 05 11 12 20 37 40 vocabulary 10 30 adequacy correctness 48 ----- description 15 22 28 31 41 51
–1 14 21 53 2 19 23 46 18 17 44 47 02 04 -----
–2 (1) Less Confident Used With Less Confidence
Facets analysis
Fit category Infit Outfit
Fit okay 24 24
Misfit 11 10
Overfit 18 19
Rater Fit – Importance
Note. Infit and Outfit are mean-square statistics. Misfit: Fit > 1.30. Overfit: Fit < 0.70. Number of raters = 53.
Facets analysis
Fit category Infit Outfit
Fit okay 22 21
Misfit 13 12
Overfit 18 20
Rater Fit – Ease-of-Application
Note. Infit and Outfit are mean-square statistics. Misfit: Fit > 1.30. Overfit: Fit < 0.70. Number of raters = 53.
Facets analysis
Fit category Infit Outfit
Fit okay 20 21
Misfit 13 12
Overfit 20 20
Rater Fit – Confidence
Note. Infit and Outfit are mean-square statistics. Misfit: Fit > 1.30. Overfit: Fit < 0.70. Number of raters = 53.
Facets analysis
Summary of Facets Analysis Findings
Raters showed significant differences in their perceptions of criteria
Raters showed substantial degrees of misfit/overfit, indicating rater heterogeneity in criterion perception (in line with the rater type hypothesis)
Criteria differed significantly along the importance and ease-of-application dimensions
The 4-point rating scales functioned effectively
Cluster analysis
Data analysis, Part II: Cluster analysis
Research questions Are there distinctive patterns of raters‘ criterion
perception along importance, ease-of-application, and confidence dimensions?
How many different patterns, or rater types, can be distinguished?
What are the rater-criterion interrelations for each of these types?
Cluster analysis
Clustering Method Error-variance approach (Eckes & Orlik, 1993; see also
Everitt, Landau, & Leese, 2001) Main objective: Joint hierarchical classification of two
different sets of elements (ie raters and criteria) Two-mode clusters with minimum internal heterogeneity
Preprocessing of the input data (ie columnwise duplication and reflection) to cluster high-rated criteria separately from low-rated criteria
Additionally: Overlapping clustering solution
Cluster analysis
Number of clusters
Stepsize criterion (increase in the cluster heterogeneity index)
Cluster cohesion index Point-biserial correlation between input data and
cluster membership
Two-Mode Clustering Solution
Importance
Overlapping solution added
Minus sign (-) indicates less important
A B
C
D
E
F
37, 32, 03, 41 correct.
18, 40 -compr.
47, 44, 39, 28, 27, 49, 20, 50 -compl. -descrip.
30, 22, 19, 11, 09, 05, 04, 02, 33 -discus.
24, 15, 14, 46, 31, 52, 23, 12, 07, 48 -standp. -vocab. -correct. -adequ.
02, 09, 31, 33, 49, 15, 23, 24, 27, 28, 44 -compl.
14, 15, 24, 31, 46, 07, 23 -adequ.
02, 11
06, 17, 24, 30, 33, 34, 49, 01, 04, 26, 29, 43, 45, 53 content
43, 36, 35, 25, 42, 26, 21, 13, 08, 01, 51, 17, 16, 10, 06, 38, 53, 45, 34, 29 descrip. compl. content vocab. standp. discus. adequ. compr.
02, 09, 15, 24, 31, 46, 07, 11
Cluster analysis
Cluster (N)
Compre-hensib.
Content Vocabu-lary
Correct-ness
Ade-quacy
Com-pleten.
Descrip-tion
Discus-sion
Stand-point
A (20)
+ + + + + + + +
B (18)
+ +
C (13)
D (16)
E (16)
F (12)
Note. + extremely important, - less important, else in between. N = number of raters per cluster (overlapping solution).
Rater Types – Importance
Two-Mode Clustering Solution
Ease-of-application
Overlapping solution added
Minus sign (-) indicates less easy
A B
C
D
E
F
27, 41, 40, 20, 10, 49 compl.
24, 14, 04, 37 -content
52, 22, 17, 34, 33 -standp. -adequ. -correct.
43, 39, 03, 18, 12, 07, 06, 05, 48, 26, 21, 01, 51, 36 -descrip. -vocab. -discus. -compl.
05, 16, 06
31, 30, 25, 19, 16, 32 -compr.
05, 06, 12, 26, 34, 36, 48
23, 53, 35, 11, 47, 50, 02, 45, 46, 42, 29, 13, 09, 08, 38, 28, 15, 44 descrip. correct. content vocab. standp. discus. adequ. compr.
05, 06, 16, 17, 26, 49
12, 14, 19, 25, 33, 35, 38, 42, 47
Cluster analysis
Cluster (N)
Compre-hensib.
Content Vocabu-lary
Correct-ness
Ade-quacy
Com-pleten.
Descrip-tion
Discus-sion
Stand-point
A (18)
+ + + + + + + +
B (13)
C (15)
+
D (10)
E (8)
F (14)
Note. + extremely easy, - not so easy, else in between. N = number of raters per cluster (overlapping solution).
Rater Types – Ease-of-Application
Two-Mode Clustering Solution
Confidence
Overlapping solution added
Minus sign (-) indicates less confident
A B
C
D
E
F
39, 26, 34, 32 -correct. -vocab.
06, 37, 17, 01, 52, 48 -content -descrip. -compl.
19, 16, 05, 31 -compr.
48, 01, 06, 07, 12, 17, 18
34, 18, 20, 40 compl.
05, 16, 01, 31, 37
27, 22, 18, 10, 07, 46 -adequ.
24, 12, 42, 09, 45, 11, 03, 15, 44, 35, 41, 40, 53, 50, 02, 08, 23, 14, 49, 38, 30, 20, 13, 43, 28, 25, 47 descrip. content vocab. standp. discus. correct. compl. compr. adequ.
29, 51, 33, 21, 04, 36 -standp. -discus.
05, 16, 21, 33, 34
Cluster analysis
Cluster (N)
Compre-hensib.
Content Vocabu-lary
Correct-ness
Ade-quacy
Com-pleten.
Descrip-tion
Discus-sion
Stand-point
A (27)
+ + + + + + + + +
B (11)
C (11)
D (6)
E (8)
+
F (11)
Note. + extremely confident, - less confident, else in between. N = number of raters per cluster (overlapping solution).
Rater Types – Confidence
(1) Rater differences in criterion perceptions
Experienced raters differed significantly in the general importance attached to criteria in the overall ease of criterion application in the confidence in using each criterion
adequately
Summary and discussion
(1) Rater differences in criterion perceptions
Experienced raters differed significantly in the general importance attached to criteria in the overall ease of criterion application in the confidence in using each criterion
adequately
Summary and discussion
(2) Distinctive rater types
Raters formed markedly different types (thus supporting the Rater Type Hypothesis)
Types were characterized by different subsets of criteria
This was particularly pronounced along the importance and ease-of-application dimensions
Summary and discussion
(3) Implications for rater training
Focus on empirically derived rater types, their strengths and weaknesses
Use of behavior-driven (or bottom-up) training procedures (Lievens, 2001) to balance raters’ attention more evenly on criteria
Rater monitoring needs to address the effects of type-based rater training on both operational rating behavior and self-reports
Summary and discussion
(4) Implications for research
Interrelations with other indicators of rater variability (eg severity/leniency, halo, rating scale use)
Influence of rater background variables (eg personal background, professional training, work experience) on interpretation and use of criteria
Combination with qualitative research strategies (eg verbal protocol analysis; Green, 1998)
Construction of type-specific rating process models
Summary and discussion
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12, 238-257.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20, 89-110.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12, 1-15.
Brown, A. (2005). Interviewer variability in oral proficiency interviews. Peter Lang: Frankfurt/Main.
Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks (TOEFL Monograph Series, MS-29). Princeton, NJ: Educational Testing Service.
Chalhoub-Deville, M. (1995). Deriving oral assessments scales across different tests and rater groups. Language Testing, 12, 17-33.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197-221.
Eckes, T. (2006). Rater types in writing performance assessments: A classification approach to rater variability. Manuscript submitted.
Eckes, T., & Orlik, P. (1993). An error variance approach to two-mode hierarchical clustering. Journal of Classification, 10, 51-74.
References
Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis (4th ed.). London: Arnold.
Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23, 131-166.
Lievens, F. (2001). Assessor training strategies and their effects on accuracy, interrater reliability, and discriminant validity. Journal of Applied Psychology, 86, 255-264.
Linacre, J. M. (2005). A user’s guide to FACETS: Rasch-model computer programs [Software manual]. Chicago: Winsteps.com.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54-71.
Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15, 158-180.
Meiron, B. E., & Schick, L. S. (2000). Ratings, raters and test performance: An exploratory study. In A. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 153-174). Cambridge: Cambridge University Press.
Pollitt, A., & Murray, N.L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (Studies in language testing, Vol. 3, pp. 74-91). Cambridge: University of Cambridge Local Examinations Syndicate.
References