Thom Kiddle & Eaquals members, Assessing oral Proficiency

Speaking test formats and task typesAnthea Wilson, Head of Test Production, Trinity College LondonBelinda Steinhuber, Head of Language Education Department, CEBS, Austria

EAQUALS members’ meeting, Florence 2016

©Eaquals 06/08/20141

Agenda

I. Speaking test formats and task typesII. Construction and validation of criteria

III. Standardisation and monitoring

practices

©Eaquals 06/08/2014 2

1. Speaking test formats and task types

Beyond the examiner-led interview:• What formats can we use to assess

speaking?• What demands do different task types

place on candidates?• What are the implications for reliable

assessment?

©Eaquals 06/08/2014 3

Why use other formats?

• focus on communicative competence• make use of more authentic tasks and

situations• include a greater variety of

communicative functions• widen the scope of task types• action-oriented approach

©Eaquals 06/08/2014 4

Activity

Watch the video and complete the table for the three task types:

• Trinity ISE II Collaborative Task (B2)• CEBS Plurilingual task (Engl. B2/French B1)• Group discussion task (B1)

©Eaquals 06/08/2014 5

ISE II Collaborative taskFor the next part, I’ll tell you something. Then, you have to ask me questions to find out more information and make comments. You need to keep the conversation going. After four minutes, I’ll end the conversation. Are you ready?

My nephew’s school has just announced that all the students might have to learn three foreign languages. I’m not sure this is a good idea.

©Eaquals 06/08/2014 6

PARTICIPANTS

1 Examiner for the second foreign language (e.g. French)

1 Candidate

Interaction

Interaction

1 Examiner for English

Plurilingual Task English and French

http://www.cebs.at/

TIME FRAME

Preparation minimum 30 min.

Exam12-15 min.

Interaction8-10 min.

Individual Long Turn

4-5 min.


http://www.cebs.at/


©Eaquals 06/08/2014 9

Rubric in German Input

mostly in German

Situation

Task Long Turn

Task Interaction

http://www.cebs.at/


TOPIC: Health and Nutrition

Situation

Your school is particularly involved in various activities encouraging a healthy lifestyle. Your class has organized a meeting with students

and teachers from other countries who are also interested in implementing projects in this field.

©Eaquals 06/08/2014 10

http://www.cebs.at/


InteractionFollowing the presentation you carry on a conversation with the visiting teachers in which you discuss the possibility of working together on interscholastic projects. • Present examples of activities or projects at your school

which promote a healthy and active lifestyle (input 2).• Inquire about similar activities at the schools of your foreign visitors.

• Discuss the possibilities of a joint project.

©Eaquals 06/08/2014 11

http://www.cebs.at/

Development of marking criteria Tim Goodier, Head of Academic Development, Eurocentres

www.eaquals.org

• Introduction to Eurocentres ‘RADIO’ task oriented assessment

• Interpreting CEFR Table 3 and other relevant sources to form profile categories and maximise pragmatic validity

• Practical considerations for scaled criteria and issues informing update for EAP

• A sample from Eurocentres standardisation materials & criteria for spoken assessment

www.eaquals.org

‘RADIO’ Task orientation

www.eaquals.org

How RADIO fits

Teacher-centred Focus on forms Present, Practice Produce (PPP)

Fluency-centred Planned focus on form ‘Free practice’ Role plays Communicative Drills Grammar Games

Natural

= Task-oriented approach approach

for fluency & assessment

Meaning-centred Focus on task Incidental focus on form Case studies Decision tasks Consensus tasks Simulations

A continuum, not categories

with fixed boundaries

R.A.D.I.O. = R: Range A: Accuracy D: Delivery I: Interaction O: Organisation & interaction

R.A.D.I.O. – group task rationale R.A.D.I.O. group tasks follow three distinct stages:Phase 1: Collaboration. Students work in small groups (2-4) to organise the task, reach a consensus/conclusion and prepare their report. (planning)

Phase 2: Exchange. Groups are remixed in order to report their findings / conclusions (report)

Phase 3: Discussion. Groups discuss either (a) the best solution or (b) discussion questions related to the task topic (discussion)

• Impression (holistic/global)

• Analysis (R,A,D & I)

• Considered judgement

Distilling a workable profiling scheme (R,A,D,I + O)

www.eaquals.org

TABLE 3 OF THE CEFR Phonology scale

Range Accuracy Fluency Interaction Coherence Pron.

Range Accuracy Delivery Interaction Overall

R+A+D: Overall Spoken Production

R+I: Overall Spoken Interaction

Certificate Profile (SP & SI)

Assessor descriptors at 10 levels (including CEFR plus levels)

www.eaquals.org

B1+

B1(CEFRtable 3)

A2+

Key considerations for ongoing update and revision

www.eaquals.org

1. Draw on validated sources, and colour code ‘master’ for future reference2. Use bulleted clusters rather than boxed paragraphs

e.g.

Blue = CEFR, purple = IELTS public descriptors, black = original RADIO, green = EAQUALS, bold = paraphrased from the source.

(Accuracy) (Delivery) Maintains a high degree of grammatical

accuracy. Error-free sentences are frequent.

Some inappropriate word choice and

occasional minor slips but few significant errors.

Uses paraphrase effectively.

Speaks confidently and spontaneously in clear, smoothly-flowing speech.

Descriptions and arguments are easy to follow.

Can vary intonation and place

sentence stress appropriately. Speech is clear and intelligible throughout.

Adaptation to include presentation task types for EAP

www.eaquals.org

Range Accuracy Delivery Interaction Organisation

R+A+D: Overall Spoken Production

R+I or R+O: Overall Spoken Interaction

Certificate Profile (SP & SI)

The ‘Interaction’ and ‘Organisation’ columns both contain the SAME descriptors for argumentation (B1+ to C2).

Structuring planned speaking to achieve a communicative objective with an audience

RADIO Grades • Based on CEFR table 2

distinguishing between spoken interaction and spoken production

• In R.A.D.I.O.:• Spoken Interaction = an average of range and interaction• Spoken Production = an average of range, accuracy and delivery

• Half grades possible, but only full grades on certificate profile

R.A.D.I.O. – Grading a spoken sample

We will now listen to a speaking sample.

Then, look at the mid-high leveldescriptors (5-9) and think about what score you might give each of them.

Rainer (left), Marco (centre) and Andreas (right) will talk about whether sport is bad for relationships and marriage

First think about who you think is lower/higher in level.

RainerA relaxed communicator.

Can initiate discourse and take his turn when appropriate

Can link his utterances into a coherent contribution.

He has a sufficient range of language to express viewpoints without much searching for words, even though many of his utterances have a strong influence from German in both formulation and pronunciation.

He cannot be said to show a relatively high degree of grammatical or lexical control.

Communicates with reasonable accuracy in familiar contexts; generally good control though with noticeable mother tongue influence. Errors occur, but it is clear what he is trying to express.

Eaquals International Conference, 21 – 23 April 2016

Marco

Good interaction skills, and able to produce stretches of language with a fairly even tempo – although can be hesitant. Generally coherent speaker with some impressive turns of phrase for the narrowness of his linguistic base. Weak on accuracy with many past tense and word order mistakes, tends not to elaborate his contribution. Appeared to improve in the course of the activity.

Eaquals International Conference, 21 – 23 April 2016

Andreas

Clearly meets all the B2 criteria on Range, Accuracy, Fluency, Interaction and Coherence. A very controlled, conscious performance showing considerable language awareness for this level. He always gets his point across effectively, though the performance is very self-conscious and a little laboured at times.

Meets the level of accuracy described for B2+ but does not consistently maintain the high degree of accuracy seen at C1, and the hesitancy he showed launching himself into both description and discussion indicates he does not meet the C1 criterion in the area of Delivery.

Alternatives to theory-driven oral assessment criteria gridsThom KiddleDirector, NILE (Norwich Institute for Language Education)

Eaquals Members Meeting, Florence, November 2016

©Eaquals 06/08/201425

Alternatives to theory-driven marking criteria

©Eaquals 06/08/2014 26

“[Theory-driven] approaches generate impoverished descriptions of communication, while performance data-driven approaches have the potential to provide richer descriptions that offer sounder inferences from score meaning to performance in specified domains.”

Fulcher et al (2011)

Potential problems with theory-driven assessment criteria

©Eaquals 06/08/2014 27

• “Reification of ordered scale descriptors” (Fulcher et al, 2011)• Standardisation with abstract concepts• May not relate to specific task demands• Encourages the ‘halo effect’

Halo effect

©Eaquals 06/08/2014 28

Try this experiment from Nobel prize winner, Daniel Kahnemann:

On the next page, you will see descriptions of two people. Read the descriptions and decide which person you view more favourably…

Halo effect

©Eaquals 06/08/2014 29

Alan is: intelligent – industrious – impulsive – critical – stubborn – envious

Ben is: envious – stubborn – critical – impulsive – industrious – intelligent

Implications

©Eaquals 06/08/2014 30

What implications might this have for traditional criteria grid models?

Fulcher et al (2011) propose Performance Decision Trees to incorporate specific reference to data obtained from successful performance on a task (and as a way to include ‘indigenous’ criteria.

©Eaquals 06/08/2014 31

©Eaquals 06/08/2014 32

You bought the product and had the problems shown in the video. Record a voicemail message for the manager of the shop, stating:- What you bought- What the problems were- What you would like them to do about itYou should speak for at least one minute.

Lexical resource (theory-driven)

©Eaquals 06/08/2014 33

Manages to talk about familiar and unfamiliar topics but uses vocabulary �with limited flexibility attempts to use paraphrase but with mixed success�

Has enough language to get by with sufficient vocabulary to express �him / herself with some hesitation and circumlocution on topics such as family, hobbies and interests, work, travel, and current events.

Lexical resource (data-driven)

©Eaquals 06/08/2014 34

Is able to describe the sequence of events using time/sequence markers. �Has sufficient resource to describe two specific problems, either with individual accurate lexis or ‘placeholder names ’ (‘thing’, ‘stuff’, ‘kind of’).Has specific lexis to refer to future action and desired outcome / response.

Can sequence events using, for example, � earlier today / this morning / when I got home / after washing.Can identify concrete nouns and problems using, for example, jeans / washing machine / shrunk / ripped / a hole.Can make demands using, for example, money back, refund, replacement, return.

Challenges with data-driven approach

©Eaquals 06/08/2014 35

• Need for different descriptors for different tasks?• Need for piloting with ‘known masters’ to obtain data?• Need for detailed task familiarity among raters?• Need to establish parallels between task demands?• Need to relate to external frameworks?

www.eaquals.org

TestDaF: The development of standardisation and monitoring practise for ratersClaudia Pop,TestDaF-Institut, g.a.s.t. e.V. Germany

1. Why standardise?2. The TestDaF

Test of German as a Foreign Language3. Rater trainings4. Conclusion

www.eaquals.org

Content

1. Why standardise?

www.eaquals.org

2. The Test of German as a Foreign Language (TestDaF)

Designed for international students applying for entry to an institution of higher education in Germany

Measures German language proficiency at an intermediate to high level (B2.1 to C1.2)

Developed, scored and evaluated at the TestDaF Institute in Germany

Can be taken in the applicant’s home country Administered worldwide since 2001

High stakes setting

www.eaquals.org

2. The TestDaF37,881 participants in 2015 – a plus of 18.8 percent from 2014 to 2015, more than 257,000 participants since 2001

1.190 3.582

7.4988.982

11.052

13.55415.389

16.88218.059 18.528

21.374

24.261

27.166

31.898

37.881

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

www.eaquals.org

Our Test Centres worldwide

Europe / Russian Federation /Turkey

356 230 460

59 45 57

Asia

5 2 4

Australia /New Zealand /Ozeania

Germany

218 149 169 TestDaF-test centres TestAS-test centres onDaF/onSET-test centres

19 14 21

Africa

41 24 66

America

2. The TestDaF

www.eaquals.org

Dev

elop

men

t

Adm

inis

trat

ion

Scor

ing

Statistical analysis

Customer service / transparency of information

2. The TestDaF

www.eaquals.org

Dev

elop

men

t

Adm

inis

trat

ion

Scor

ing



Standardized format Training and guidelines for

item writers Extensive trialling

procedures for each test version

2. The TestDaF : Development

www.eaquals.org

Testokay?

Piloting

Ready to go

No

Yes

Revision Trialling

Item and task development

2. The TestDaF

www.eaquals.org

Dev

elop

men

t

Adm

inis

trat

ion

Scor

ing


Customer service / transparency of information Administration in licenced test

centres Training and monitoring for test

administrators Detailed security instructions and

procedures Inspections

2. The TestDaF

www.eaquals.org

Dev

elop

men

t

Adm

inis

trat

ion

Scor

ing



Training of raters Monitoring

Calibration materials Regular evaluation of

rater behaviour

2. The TestDaF

1.190 3.582

7.4988.982

11.052

13.55415.389

16.88218.059 18.528

21.374

24.261

27.166

31.898

37.881

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

2 test dates Calibration- /

training session before each test date

www.eaquals.org

2. The TestDaF

1.190 3.582

7.4988.982

11.052

13.55415.389

16.88218.059 18.528

21.374

24.261

27.166

31.898

37.881

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

6 test dates (4+2) Separation:

calibration materials ≠ Rater trainings

www.eaquals.org

2. The TestDaF

1.190 3.582

7.4988.982

11.052

13.55415.389

16.88218.059 18.528

21.374

24.261

27.166

31.898

37.881

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

From now on: 9 test dates per year (6+3)

www.eaquals.org

2. The TestDaF

1.190 3.582

7.4988.982

11.052

13.55415.389

16.88218.059 18.528

21.374

24.261

27.166

31.898

37.881

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Modifications in the standardisation

process for raters

www.eaquals.org

3. The TestDaF: Rater Trainings

Trained raters 10/2016

2010 2012 2014 20160

50

100

150

200

250

300

350

www.eaquals.org


2010 2012 2014 20160

1

2

3

4

5

6

7

8

9

10

initial trainings re-trainings

www.eaquals.org

3. The TestDaF: Rater TrainingsInitial trainings, goals:

Explaining construct, format Introducing the TestDaF-criteria and the rating

procedure Operationalizing the process and criteria: rating of

performances and group discussion Raising awareness of rater effects Explaining of the statistical procedures of quality

ensurance (MFR Analysis)

www.eaquals.org

3. The TestDaF: Rater TrainingsInitial trainings, modifications:

Since 2008: e-learning unit to be completed before the actual 2-day training session

Since 2009: Presentation slot on practical and logistical procedures

Since 2013: successful individual rating as a condition to be contracted

www.eaquals.org

3. The TestDaF: Rater TrainingsRe-trainings, goals:

Recollecting the goal (construct) Individual rating of performances and group

discussion Discussing external effects Giving updates about TestDaF-Institut Further training about chosen topics Giving the opportunity to meet “the others” – “rating is a

lonely job”

www.eaquals.org


Re-trainings, modifications:

Since 2013: Re-trainings are led by specially trained senior

raters Re-trainings are taking place across Germany Preparation weekend in January of each year

www.eaquals.org

3. The TestDaF: Rater TrainingsFollow up-problem: Raters feel they are losing contact with the TestDaF-staff

Since 2016: Online-consultation hours (Vitero team room) In each assessment phase Separately for Writing and Speaking

www.eaquals.org

Summing up

www.eaquals.org

Calibration session

Summing up

www.eaquals.org

Calibration material

Rater trainings

Summing up

www.eaquals.org

Rater trainings

Initial rater trainings Re-trainings

Online-consultation

hours


Conclusion

www.eaquals.org

Conclusion

www.eaquals.org


Initial rater trainings

Consultation hours

Re-trainings

Online training

www.eaquals.org

Conclusion

Standardisation – a practical example in a lowish-stakes context.Emma HeydermanDirector of EducationLacunza - IH

www.eaquals.org

Our journey• about us• the now• and the future

www.eaquals.org

www.eaquals.org

www.eaquals.org

English & French:• 5• 11• 5,500 (70:30)• 3 hrs / wk• 110• 30

www.eaquals.org

• Que comiencen bien con el inglés, familiarizándose con el idioma en un ambiente ameno, adquiriendo los hábitos de estudio que utilizarán en el futuro.

• Si se sigue la trayectoria Lacunza, al terminar los estudios de secundaria el nivel de vuestro hij@ será de dominio del idioma C1.

www.eaquals.org

Continuous assessment of:

ATTITUDE | ATTENDANCE| PUNCTUALITY

Speaking, Listening, Structure, Vocabulary, Writing.

• A-B Performance above expected level• C ‘On track’• D-E Needs improvement

www.eaquals.org

Speaking• Students are placed in level in September• Their speaking performance is assessed:

• informally through activities in class• formally through at least three assessed speaking

tasks per year• Teachers use our own Speaking & Writing

Assessment Handbook

www.eaquals.org

www.eaquals.org

And next?• complete the training course• but consider the implications for

• teaching and learning(How do these clips inform our reflections on our teaching

and our students’ learning and/or performance?)• evaluation and assessment

(How do these clips inform the decisions we make about evaluation and assessment?)

www.eaquals.org

Thank-you!

www.eaquals.org

www.eaquals.org

Thom Kiddle & Eaquals members, Assessing oral Proficiency

Education

Transcript of Thom Kiddle & Eaquals members, Assessing oral Proficiency