Lecture 11:Evaluation & Wrap-up - Universitetet i oslo · Lecture 11:Evaluation & Wrap-up Pierre...
Transcript of Lecture 11:Evaluation & Wrap-up - Universitetet i oslo · Lecture 11:Evaluation & Wrap-up Pierre...
-
Lecture 11:Evaluation & Wrap-up
Pierre Lison, Language Technology Group (LTG)
Department of Informatics
Fall 2012, October 15 2012
lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Outline
• Evaluation of dialogue systems• Wrap-up• Questions, comments?
2lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Outline
•Evaluation of dialogue systems• Wrap-up• Questions, comments?
3lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Evaluation
• Standard evaluation metrics for some tasks:• ASR: Word Error Rate (cf. lecture 7)• NLU: [precision, recall, F-score] for parsing, reference
resolution, and dialogue act recognition
• TTS: evaluation by human listeners on sound intelligibility and quality
• But evaluating the global conversational behaviour of the system is much trickier!
4lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Evaluation
• A good (but subjective) way to evaluate a dialogue system is to get user satisfaction ratings
• This can be done via surveys that users are asked to fill after interacting with the system, for instance:
5
TTS Performance Was the system easy to understand ?
ASR Performance Did the system understand what you said?
Task Ease Was it easy to find the message/flight/train you wanted?
Interaction Pace Was the pace of interaction with the system appropriate?
User Expertise Did you know what you could say at each point?
System Response How often was the system sluggish and slow to reply to you?
Expected Behavior Did the system work the way you expected it to?
Future Use Do you think you’d use the system in the future?
[M. Walker et al. (2001), «Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems», Proceedings of ACL]
lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Evaluation
• User evaluation surveys are useful, but they are expensive and time-consuming
• Not feasible to do user evaluations after each system change!
• Need a way to automate the evaluation process • Possible solution: rely on evaluation metrics that
can be directly extracted from the dialogue, and are known to correlate with user satisfaction
• Improving these observable metrics should therefore increase user satisfaction
6lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Evaluation
7
Criteria Description Possible metrics
Task completion
success
How often was the system able to complete its task successfully?
- κ agreement for slot-filling applications- completion ratio
Efficiency costs
How efficient was the system in executing its task?
- number of turns (from user, system, or both)- total elapsed time
Quality costs
How good was the system interaction?
- number of ASR rejection prompts- number of user barge-ins- number of error messages
lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Evaluation
• Assume we found a set of metrics for our domain... how do we weight their relative importance?
• PARADISE framework:1. We start by doing a user satisfaction survey for a set of dialogues
(via questionnaires such as the one we have seen)
2. We also measure their performance according to our metrics (task success, number of turns, number of errors, etc.)
3. Finally, we apply multiple regression to train the weight of each metric (this way, we can ensure our metrics correlate with the user satisfaction)
8
[M. Walker et al. (1997), "PARADISE: A general framework for evaluating spoken dialogue agents", Proceedings of ACL]
lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Evaluation
9
PARADISE model of performance:
Performance =n�
i=1
wi mi
Weight of metric i (to learn via regression based on the user satisfaction)
Measure for metric i (for instance, task success = 0.91)
NB: the weights can be negative!
lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Evaluation
• The PARADISE model works quite well for slot-filling applications
• Appropriate metrics might be harder to define or apply to other domains
• For instance, a «companion»-type of agent has no clear task, and shouldn’t necessarily minimise the number of turns!
• For these domains, the notion of «appropriateness», i.e. the system’s ability to maintain a natural, fluid conversation over time might be more important
10
[D. Traum et al. (2004), «Evaluation of multi-party virtual reality dialogue interaction», in Proceedings of LREC]
lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Outline
• Evaluation of dialogue systems•Wrap-up• Questions, comments?
11lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Wrap-up
12
Spoken dialogue is...
a joint activity1 uncertain2
structured3contextual4
goal-driven5
lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Dialogue is a joint activity
13
The dialogue participants take turns and perform dialogue acts
They interpret each others utterances cooperatively
They cooperate to achieve mutual understanding (common ground)
They naturally align their way of speaking
lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Dialogue is uncertain
• Uncertainty is everywhere in dialogue:
• Speech recognition is always error-prone!
• Ambiguities are part and parcel of human language
• User behaviours are often unpredictable
• The dialogue context is rarely crystal clear
14
That’s why probabilistic modelling is instrumental in the design of robust dialogue systems
lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Dialogue is structured
• Dialogue is full of intricate structures:• Syntactic and prosodic phrases• Semantic relations within an utterance• Pragmatic relations between utterances / dialogue acts• References to entities, persons, places, events• Attentional and intentional structure
15lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
A small experiment
16
A small test to check how well you can concentrate on a task!
lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Dialogue is contextual
• Context is crucial for dialogue processing:• Pronunciation varies depending on the context• Non-sentential utterances only make sense in a situation• Omnipresence of deictics in dialogue
17
• Dialogue management & generation must adapt to contextual factors
• More than just «context-sensitivity»: selective attention actually drives the processing
lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Dialogue is goal-driven
• We communicate to do things in the world
• Dialogue acts guided by intentions and provoking effects
• Verbal and non-verbal actions are intertwined
18
decision-theoretic
• Dialogue participants have multiple competing goals to fulfill, leading to a problem of utility maximisation
• To this end, they must plan their actions over time
lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Outline
• Evaluation of dialogue systems• Wrap-up•Questions, comments?
19lørdag 13. oktober 2012
@ 2012, Pierre Lison - INF5820 course
Practical details
• I will provide a detailed list of sections to read in the Martin & Jurafsky book
• You can find more explanations there if the slides are unclear• You only have to study what we have seen in the lectures
(e.g. what is on the slides)
• Some material in the slides are not in the M&J book (these are of course part of the pensum!)
• I will also provide a mock-up exam, so you know «what to expect» for the final exam
20lørdag 13. oktober 2012
-
@ 2012, Pierre Lison - INF5820 course
Questions, comments
• Questions on the course?• Problems you are struggling with?• Comments, advices on my teaching style?
21lørdag 13. oktober 2012