Language Testing (Leeds)1

Language TestingLiu Jianda

Syllabus

Understand the general considerations that must be addressed in the development of new tests or the selection of existing language tests; Make their own judgements and decisions about either selecting an existing language test or developing a new language test;Familiarise themselves with the fundamental issues, approaches, and methods used in measurement and evaluation; Design, develop, evaluate and use language tests in ways that are appropriate for a given purpose, context, and group of test takers;Understand the future development of language testing and the application of IT to computerized language testing.It is expected that, by the end of this module, participants should be able to do the following :

SyllabusIn order to achieve these objectives, the module gives participants the opportunity to develop the following skills:writing test itemscollecting test data and conducting item analysisevaluating language tests with regard to validity and reliability

This is done by considering a wide range of issues and topics related to language testing. These include the following :General concepts in language testing and evaluationEvaluation of a language test: reliability and validityCommunicative approach to language testingDesign of a language testItem writing and item analysisInterpreting test results Item response theory and its applicationsComputerized language testing and its future development

Class Schedule

1

Basic concepts in language testing

2

Test validation: reliability and validity (1)

3

Test validation: reliability and validity (2)

4

Test construction (1)

5


6


7

Test Construction (4)

8


9


10

Rasch analysis (1)

11

Rasch analysis (2)

12

Language testing and modern technology

AssessmentOne 5000 6000 word paper on language testingCollaborative work:Youll be divided into group of four to complete the development of a test paper. Each of you will be responsible for one part of the test paper. But each part should contribute equally to the whole test paper. Therefore, besides developing your part, you need to come together to discuss the whole test paper in terms of reliability and validity.

Course books

Bachman, L. F. & Palmer, A. (1996). Language Testing in Practice. Oxford: Oxford University Press.Brown, J. D. (1996). Testing in Language Programs. Upper Saddle River, NJ: Prentice Hall Regents.Li, X. (1997). The Science and Art of Language Testing. Changsha: Hunan Educational Press. McNamara, T. (1996). Measuring second language performance. London ; New York: Longman.

Website:http://www.clal.org.cn/personal/testing/Leeds

Session 1Basic concepts in language testing

. A short history of language testing

Spolsky (1978) classified the development of language testing into three periods, or trends: the prescientific period the psychometric/structuralist period the integrative/sociolinguistic period.

The prescientific period grammar-translation approaches to language teaching translation and free composition testsdifficult to score objectivelyno statistical techniques applied to validate the testssimple, but unfair to students

The psychometric-structuralist period audio-lingual and related teaching methodsobjectivity, reliability, and validity of tests consideredmeasure discrete structure pointsmultiple-choice format (standardized tests)follow scientific principles, have trained linguists and language testers

The integrative-sociolinguistic period communicative competenceChomskys (1965) distinction of competence and performanceCompetence: an ideal speaker-listeners knowledge of the rules of the language; performance: the actual use of language in concrete situations Hymess (1972) proposal of communicative competencethe ability of native speakers to use their language in ways that are not only linguistically accurate but also socially appropriate.Canale & Swains (1980) framework of communicative competence:Grammatical competence, mastery of the language code such as morphology, lexis, syntax, semantics, phonology;Sociolinguistic competence, mastery of appropriate language use in different sociolinguistic contexts;Discourse competence, mastery of how to achieve coherence and cohesion in spoken and written communicationStrategic competence, mastery of communication strategies used to compensate for breakdowns in communication and to enhance the effectiveness of communication.

The integrative-sociolinguistic period Bachmans (1990)s framework of communicative language ability:Language competence: grammatical, sociolinguistic, and discourse competence (Canale & Swain):organizational competencegrammatical competencetextual competencepragmatic competenceillocutionary competencesociolinguistic competenceStrategic competence: performs assessment, planning, and execution functions in determining the most effective means of achieving a communicative goalPsychophysiological mechanisms: characterize the channel (auditory, visual) and mode (receptive, productive)

The integrative-sociolinguistic period Ollers (1979) pragmatic proficiency test:Temporally and sequentially consistent with the real world occurrences of language formsLinking to a meaningful extralinguistic context familiar to the testeesClarks (1978) direct assessment: approximating to the greatest extent the testing context to the real worldCloze test and dictation (Yang, 2002b)Communicative testing or to test communicatively

The integrative-sociolinguistic period Performance tests (Brown, Hudson, Norris, & Bonk, 2002; Norris, 1998)Not discrete-point in natureIntegrating two or more of the language skills of listening, speaking, reading, writing, and other aspects like cohesion and coherence, suprasegmentals, paralinguistics, kinesics, pragmatics, and cultureTask-based: essays, interviews, extensive reading tasks

Performance TestsThree characteristics: The task should:be based on needs analysis (What criteria should be used? What content and context? How should experts be used?)be as authentic as possible with the goal of measuring real-world activitiessometimes have collaborative elements that stimulate communicative interactionsbe contextualized and complexintegrate skills with contentbe appropriate in terms of number, timing, and frequency of assessmentbe generally non-intrusive, that is, be aligned with the daily actions in the language classroom

Performance Tests Raters should be appropriate in terms of:number of ratersoverall expertisefamiliarity and training in use of the scale The rating scale should be based on appropriate:categories of language learning and developmentappropriate breadth of information regarding learner performance abilitiesstandards that are both authentic and clear to studentsTo enhance the reliability and validity of decisions as well as accountability, performance assessments should be combined with other methods for gathering information (e.g. self-assessments, portfolios, conferences, classroom behaviors, and so forth)

Development graph (Li, 1997: 5)

2. Theoretical issues

Language testing is concerned with both content and methodology.

Development since 1990Communicative language testing (Weir, 1990)Reliability and validitySocial functions of language testing

Ethical language testing

Washback (impact) (Qi, 2002; Wall, 1997)impact: effects of tests on individuals, policies or practices within the classroom, the school, the educational system or society as a wholewashback: effects of tests on language teaching and learningWays of investigating washback:analyses of test resultsteachers and students accounts of what takes place in the classroom (questionnaires and interviews)classroom observationEthics of test useuse with care (Spolsky, 1981: 20)codes of practiceProfessionalization of the fieldtraining of professionalsdevelopment of standards of practice and mechanism for their implementation and enforcementCritical language testingput language testing in the society

Factors affecting performance of examinees

Development since 1990Testing interlanguage pragmatic knowledgecurrently on research levelfocus on method validationweb-based test by Roever Computerized language testingItem bankingComputer-assisted language testingComputerized adaptive language testingTest items adapted for individualsTest ends when examinees ability is determinedTest time very shorterWeb-based testingPhonepass testing

Development since 1990Language testing and second language acquisition (Bachman & Cohen, 1998)Help to define construct of language abilityUse findings of language testing to prove hypotheses in SLAProvide SLA researchers with testing and standards of testing

Development of research methodologyFactor analysisThe main applications of factor analytic techniques are:(1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method

Generalizability theory (Bachman, 1997; Bachman, Lynch, & Mason, 1995)Estimating the relative effects of different factors on test scores (facets)The most generalizable indicator of an individuals language ability is the universe score, however, in real world, we can only obtain scores from a limited sample of measures, so we need to estimate the dependability of a given observed score as an estimate of the universe score.

Two stages are involved in applying G-theory to test developmentG-studyThe purpose is to estimate the effects of the various facets in the measurement procedure (usually conducted in pretesting). e.g. persons (differences in individuals speaking ability), raters (differences in severity among raters), tasks (differences in difficulty of tasks); two-way interactions:task x rater different raters are rating the different tasks differentlyperson x task some tasks are differentially diffucult for different groups of test takers (source of bias)person x rater some raters score the performance of different groups of test takers differently (indication of rater bias)

Two stages are involved in applying G-theory to test developmentD-studyThe purpose is to design an optimal measure for the interpretations or decisions that are to be made on the basis of the test scores (estimation of dependability).Generalizability coefficient (G coefficient) provides an estimate of the proportion of an individuals observed score that can be attributed to his or her universe score, taking into consideration the effects of the different conditions of measurement specified in the universe of generalization. But it is appropriate for norm-referenced tests.For criterion-referenced tests, use phi coefficient. (GENOVA)

Item response theory (Rasch model)It enables us to estimate the statistical properties of items and the abilities of test takers so that these are not dependent upon a particular group of test takers or a particular form of a test. It is widely used in large-scale standardized test.

Structural equation model (Antony John Kunnan, 1998) A combination of multiple regression, path analysis and factor analysisAttempts to explain a correlation or a covariance data matrix derived from a set of observed variables; latent variables are responsible for the covariance among the measured variables.

Basic procedures in SEM (Example from Purpura, 1998) Examine the relationships between strategy use and second language test performance.Design two questionnaires for cognitive strategies and metacognitive strategies (40 items)Ask respondents to answer the questionnairesRespondents take a foreign language testCluster the 40 items to measure several variablesCompute the reliability of the variablesConduct factor analysis to identify factorsConduct SEM analysis (AMOS, EQS, LISREL)

Qualitative method Verbal report (think-aloud, introspective)ObservationQuestionnaires and interviewsDiscourse analysis

3. Classification of language testsAccording to familiesNorm-referenced testsCriterion-referenced tests

Norm-referenced tests Measure global language abilities (e.g. listening, reading speaking, writing)Score on a test is interpreted relative to the scores of all other students who took the testNormal distribution

Normal Distributionhttp://stat-www.berkeley.edu/~stark/Java/NormHiLite.htm

Norm-referenced testsStudents know the format of the test but do not know what specific content or skill will be testedA few relatively long subtests with a variety of question contents

Criterion-referenced testsMeasure well-defined and fairly specific objectivesInterpretation of scores is considered absolute without referring to other students scoresDistribution of scores need not to be normalStudents know in advance what types of questions, tasks, and content to expect for the testA series of short, well-defined subtests with similar question contents

According to decision purposesProficiency testsPlacement testsAchievement testsDiagnostic tests

Proficiency testsTest students general levels of language proficiencyThe test must provide scores that form a wide distribution so that interpretations of the differences among students will be as fair as possibleCan dramatically affect students lives, so slipshod decision making in this area would be particularly unprofessional

Placement testsGroup students of similar ability levels (homogeneous ability levels)Help decide what each students appropriate level will be within a specific programRight tests for right purposes

Achievement testsAbout the amount of learning that students have doneThe decision may involve who will a advanced to the next level of study or which students should graduateMust be designed with a specific reference to a particular courseCriterion-referenced, conducted at the end of the programUsed to make decisions about students levels of learning, meanwhile can be used to affect curriculum changes and to test those changes continually against the program realities

Diagnostic testsAimed at fostering achievement by promoting strengths and eliminating the weaknesses of individual studentsRequire more detailed information about the very specific areas in which students have strengths and weaknessesCriterion-referenced, conducted at the beginning or in the middle of a language courseCan be diagnostic at the beginning or in the middle but achievement test at the endPerhaps the most effective use of a diagnostic test is to report the performance level on each objective (in a percentage) to each student so that he or she can decide how and where to invest time and energy most profitably

Formative assessment vs. summative assessment Formative: a judgment of an ongoing program used to provide information for program review, identification of the effectiveness of the instructional process, and the assessment of the teaching processSummative: a terminal evaluation employed in the general assessment of the degree to which the larger outcomes have been obtained over a substantial part of or all of a course. It is used in determining whether or not the learner has achieved the ultimate objectives for instruction which were set up in advance of the instruction.

Public examinations vs. classroom tests Purpose: proficiency vs. achievement (placement, diagnostic)Format: standardized vs. open (objective vs. subjective)Scale: large-scale vs. small-scale (self-assessment)Scores: normality, backwash

Language Testing (Leeds)1

Documents

Transcript of Language Testing (Leeds)1