Methodologies for Evaluating Dialog Structure Annotation
description
Transcript of Methodologies for Evaluating Dialog Structure Annotation
![Page 1: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/1.jpg)
Methodologies for EvaluatingDialog Structure Annotation
Ananlada Chotimongkol
Presented at Dialogs on Dialogs Reading Group27 January 2006
![Page 2: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/2.jpg)
Dialog structure annotation evaluation
How good is the annotated dialog structure? Evaluation methodologies
1. Qualitative evaluation (humans rate how good it is)2. Compare against a gold standard (usually created by
a human)3. Evaluate the end product (task-based evaluation)4. Evaluate the principles used5. Inter-annotator agreement (comparing subjective
judgment when there is no single correct answer)
![Page 3: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/3.jpg)
Choosing evaluation methodologies
Depended on what kind of information being annotated
1. Categorical annotation e.g. dialog act
2. Boundary annotation e.g. discourse segment
3. Structural annotation e.g. rhetorical structure
![Page 4: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/4.jpg)
Categorical annotation evaluation Cochran's Q test
Test whether the number of coders assigning the same label at each position is randomly distributed
Doesn’t tell directly the degree of agreement Percentage of agreement
Measures how often the coders agree Doesn’t account for agreement by chance
Kappa coefficient [Carletta, 1996] Measures pairwise agreement among coders
correcting for expected chance agreement
![Page 5: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/5.jpg)
Kappa statistic
Kappa coefficient (K) measures pairwise agreement among coders on categorical judgment
P(A) is the proportion of times the coders agree P(E) is the proportion of times they are expected to
agree by chance K > 0.8 indicates substantial agreement 0.67 < K < 0.8 indicates moderate agreement Difficult to calculate chance expected agreement in
some cases
)(1)()(
EPEPAPK
![Page 6: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/6.jpg)
Boundary annotation evaluation Use Kappa coefficient
Don’t compare the segments directly but compare a decision on placing each boundary
At each eligible point, making a binary decision whether to annotate it as “boundary” or “non-boundary”
However, Kappa coefficient doesn’t accommodate near-miss boundaries Redefine a matching criterion e.g. also count near-miss
as match Use other metrics e.g. probabilistic error metrics
![Page 7: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/7.jpg)
Probabilistic error metrics
Pk [Beeferman et al, 1999] Measure how likely two time points are
classified into different segments Small Pk means high degree of agreement
WindowDiff (WD) [Pevzner and Hearst, 2002] Measure the number of intervening topic
breaks between time points Penalize the difference in the number of
segment boundaries between two time points
![Page 8: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/8.jpg)
Structural annotation evaluation Cascaded approach
Evaluate one level at a time Evaluate the annotation of the higher level only if the annotation of the
lower level is agreed Example: nested game annotation in Map Task [Carletta et al, 1997]
Redefine matching criteria for structural annotation [Flammia and Zue, 1995] Segment A matches segment B if A contains B Segment A in annotation-i matches with segments in annotation-j if
segments in annotation-j excludes segment A Agreement criterion isn’t symmetry
Flattened the hierarchical structure Flatten the hierarchy into overlapping spans Compute agreement on the spans or spans’ labels Example: RST annotation [Marcu et al, 1999]
![Page 9: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/9.jpg)
Form-based dialog structure Describe a dialog structure using a task structure: a
hierarchical organization of domain information Task: a subset of dialogs that has a specific goal Sub-task:
A decomposition of a task Corresponds to one action (the process that uses
related pieces of information together to create a new piece of information or a new dialog state)
Concept: is a word or a group of words that captures information necessary for performing an action
Task structure is domain-dependent
![Page 10: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/10.jpg)
An example of form-based structure annotation<task name=” “> <sub-task name=” “>
word1 word2 <concept name=” “>word3</concept> word4 … wordnword1 <concept name=” “>word2</concept> word3 word4 … wordn…
</sub-task> <sub-task name=” “> … … … </sub-task></task>
![Page 11: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/11.jpg)
Annotation experiment Goal: to verify that the form-based dialog structure
can be understood and applied by other annotators The subjects were asked to identify the task structure
of the dialogs in two domains Air travel planning domain Map reading domain
Need a different set of labels for each domain Equivalent to design domain-specific labels from the
definition of dialog structure components
![Page 12: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/12.jpg)
Annotation procedure The subjects study an annotation guideline
Definition of the task structure Examples from other domains (bus schedule
and UAV flight simulation) For each domain, the subject study the
transcription of 2-3 dialogs1. Create a set of labels for annotating the task
structure2. Annotate the given dialogs with the set of
labels designed in 1)
![Page 13: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/13.jpg)
Issues on task structure annotation evaluation There are more than one acceptable annotation
Similar to MT evaluation But difficult to obtain multiple references
The tag set used by two annotator may not be the same1. <time>two thirty</time>2. <time><hour>two</hour> <min>thirty<min></time> Difficult to define matching criteria Mapping equivalent labels between two tag sets is
subjective (and may not be possible)
![Page 14: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/14.jpg)
Cross-annotator correction
Ask a different annotator (2nd annotator) to judge the annotation and make a correction on the part that doesn’t conform to the guideline
If the 2nd annotator agrees with the 1st one, he will make no correction The annotation of the 2nd annotator himself
may be different because there can be more than one annotation that conform with the rule
![Page 15: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/15.jpg)
Cross-annotator correction (2)
Pro: Easier to evaluate the agreement, the
annotations are based on the same tag set Allow more than one acceptable annotations
Con: Need another annotator, take time Another subjective judgment Need to measure amount of change made by
the 2nd annotator
![Page 16: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/16.jpg)
Cross-annotators
Who should be the 2nd annotators Another subject who did the annotation also
Bias toward his own annotation? Another subject who studies the guideline but
didn’t do his/her own annotation May not think about the structure thoroughly
Experts Can also measure annotation accuracy using an
expert annotation as a reference
![Page 17: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/17.jpg)
How to quantify amount of correction
Edit distance from the original annotation Structural annotation, have to redefine edit
operations Lower number means higher agreement, but
which range of values is acceptable Inter-annotator agreement
Can apply structural annotation evaluation Agreement number is meaningful, can
compare across different domain
![Page 18: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/18.jpg)
Cross-annotation agreement
Use similar approach to [Marcu et al, 1999] Flatten the hierarchy into overlapping spans Compute agreement on the labels of the
spans (task, sub-task, concept labels) Issues
A lot of possible spans with no label (esp. for concept annotation)
How to calculate P(E) when add new concepts
![Page 19: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/19.jpg)
Objective annotation evaluation Make it more comparable to other works Easier to evaluation, don’t need the 2nd
annotator Label-insensitive
3 labels: <task>, <sub-task>, <concept> May also consider the level of sub-tasks e.g.
<sub-task1>, <sub-task2> Kappa artificially high
Add qualitative analysis on what they don’t agree on
![Page 20: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/20.jpg)
Reference J. Carletta, "Assessing agreement on classification tasks the kappa
statistic," Computational Linguistics, vol. 22, pp. 249-254, 1996. D. Beeferman, A. Berger, and J. Lafferty, "Statistical Models for Text
Segmentation," Machine Learning, vol. 34, pp. 177-210, 1999. L. Pevzner and M. A. Hearst, "A critique and improvement of an
evaluation metric for text segmentation," Computational Linguistics, vol. 28, pp. 19-36, 2002.
J. Carletta, S. Isard, G. Doherty-Sneddon, A. Isard, J. C. Kowtko, and A. H. Anderson, "The reliability of a dialogue structure coding scheme," Computational Linguistics, vol. 23, pp. 13-31, 1997.
G. Flammia and V. Zue, "Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue," in the Proceedings of Eurospeech 1995. Madrid, Spain, 1995.
D. Marcu, E. Amorrortu, and M. Romera, "Experiments in constructing a corpus of discourse trees," in the Proceedings of the ACL Workshop on Standards and Tools for Discourse Tagging, College Park, MD, 1999.
![Page 21: Methodologies for Evaluating Dialog Structure Annotation](https://reader036.fdocuments.net/reader036/viewer/2022081604/5681638d550346895dd4820e/html5/thumbnails/21.jpg)
Matching criteria
Exact match (pairwise) Partial match (pairwise) Agree with majority (pool of coders) Agree with consensus (pool of coders)