Evaluation State-of the-art and future actions

EvaluationState-of the-art and future actions

Bente Maegaard

CST, University of [email protected]

Bente Maegaard, LREC 2006 2

Evaluation at LREC

More than 150 papers were submitted to the Evaluation track, both Written and Spoken

This is a significant rise compared to previous years

Evaluation as a field is attracting increasing interest. Many papers discuss evaluation methodology, the field is still

under development, and the answers to some of the methodological questions are still not known.

• An example: MT• Automatic evaluation• Evaluation in Context (task-based, function-based)


Evaluation Written

Parsing evaluation 6Semantics, sense 6 Evaluation methodologies 7Time annotation 9MT 13Annotation, alignment, morph. 15Lexica, tools 21QA, IR, IE, summarisation, authoring 25

Total 102

Note: These figures may contain papers that were originally in other tracks.


Discussion MT evaluation

MT evaluation since 1965Van Slype: Adequacy, fluency, fidelity, Human evaluation, expensive, time-consuming, problems with counting of

errors, objective?

Formalising human evaluation, adding e.g. grammaticalityAnother measure: Cost of post-editing, objective

Automatic evaluation: Papineni et al. 2001: BLEU, with various modifications. Expensive to establish the reference translations, after that cheap and fast.

However, research shows that this automatic method does not correlate well with human evaluation, also does not correlate with the cost of post-editing etc.

Automatic statistical evaluation can probably be used for evaluation of MT for gisting, but it cannot be used for MT for publishing


Digression: Metrics that do not work

Why is it so difficult to evaluate MT?Because there is more than one correct answer.And because answers may be more or less correct.

Measures like WER are not relevant for MT.

Methods relying on a specific number of words in the translation are not OK (if the translation does not have the same number of words as the reference)


Generic Contextual Quality Model (GCQM)

Popescu-Belis et al. LREC2006

Building on the same thinking as the FEMTI taxonomyOne can only evaluate a system in the context in which it will

be used.

Quality workshop 27/5: task-based, function-based evaluation. (Huang, Take)

Karen Sparck-Jones: ‘the set-up’

So, the understanding that a system can only be reasonably evaluated wrt. a specific task, is accepted

Domain-specific vs. general purpose MT


What do we need? When?

What?In the field of MT evaluation we need more experiments in order to establish a

methodology.The French CESTA (Hamon et al, LREC2006) is a good example.

So, we need international cooperation for the infrastructure, but in the first instance this cooperation should lead to reliable metrics for MT evaluation. Later on it may be used for actually measuring MT systems’ performance.(Of course not only MT!)

When?As soon as possible. Start with methodology, for each applicationMove on to doing evaluationGoal: in 2011 we can reliably evaluate MT - and other applications!

Evaluation State-of the-art and future actions

Documents

Transcript of Evaluation State-of the-art and future actions