Lecture 11:Evaluation & Wrap-up - Universitetet i oslo · Lecture 11:Evaluation & Wrap-up Pierre...

11
Lecture 11:Evaluation & Wrap-up Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, October 15 2012 lørdag 13. oktober 2012 @ 2012, Pierre Lison - INF5820 course Outline Evaluation of dialogue systems Wrap-up Questions, comments? 2 lørdag 13. oktober 2012

Transcript of Lecture 11:Evaluation & Wrap-up - Universitetet i oslo · Lecture 11:Evaluation & Wrap-up Pierre...

  • Lecture 11:Evaluation & Wrap-up

    Pierre Lison, Language Technology Group (LTG)

    Department of Informatics

    Fall 2012, October 15 2012

    lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Outline

    • Evaluation of dialogue systems• Wrap-up• Questions, comments?

    2lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Outline

    •Evaluation of dialogue systems• Wrap-up• Questions, comments?

    3lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Evaluation

    • Standard evaluation metrics for some tasks:• ASR: Word Error Rate (cf. lecture 7)• NLU: [precision, recall, F-score] for parsing, reference

    resolution, and dialogue act recognition

    • TTS: evaluation by human listeners on sound intelligibility and quality

    • But evaluating the global conversational behaviour of the system is much trickier!

    4lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Evaluation

    • A good (but subjective) way to evaluate a dialogue system is to get user satisfaction ratings

    • This can be done via surveys that users are asked to fill after interacting with the system, for instance:

    5

    TTS Performance Was the system easy to understand ?

    ASR Performance Did the system understand what you said?

    Task Ease Was it easy to find the message/flight/train you wanted?

    Interaction Pace Was the pace of interaction with the system appropriate?

    User Expertise Did you know what you could say at each point?

    System Response How often was the system sluggish and slow to reply to you?

    Expected Behavior Did the system work the way you expected it to?

    Future Use Do you think you’d use the system in the future?

    [M. Walker et al. (2001), «Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems», Proceedings of ACL]

    lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Evaluation

    • User evaluation surveys are useful, but they are expensive and time-consuming

    • Not feasible to do user evaluations after each system change!

    • Need a way to automate the evaluation process • Possible solution: rely on evaluation metrics that

    can be directly extracted from the dialogue, and are known to correlate with user satisfaction

    • Improving these observable metrics should therefore increase user satisfaction

    6lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Evaluation

    7

    Criteria Description Possible metrics

    Task completion

    success

    How often was the system able to complete its task successfully?

    - κ agreement for slot-filling applications- completion ratio

    Efficiency costs

    How efficient was the system in executing its task?

    - number of turns (from user, system, or both)- total elapsed time

    Quality costs

    How good was the system interaction?

    - number of ASR rejection prompts- number of user barge-ins- number of error messages

    lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Evaluation

    • Assume we found a set of metrics for our domain... how do we weight their relative importance?

    • PARADISE framework:1. We start by doing a user satisfaction survey for a set of dialogues

    (via questionnaires such as the one we have seen)

    2. We also measure their performance according to our metrics (task success, number of turns, number of errors, etc.)

    3. Finally, we apply multiple regression to train the weight of each metric (this way, we can ensure our metrics correlate with the user satisfaction)

    8

    [M. Walker et al. (1997), "PARADISE: A general framework for evaluating spoken dialogue agents", Proceedings of ACL]

    lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Evaluation

    9

    PARADISE model of performance:

    Performance =n�

    i=1

    wi mi

    Weight of metric i (to learn via regression based on the user satisfaction)

    Measure for metric i (for instance, task success = 0.91)

    NB: the weights can be negative!

    lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Evaluation

    • The PARADISE model works quite well for slot-filling applications

    • Appropriate metrics might be harder to define or apply to other domains

    • For instance, a «companion»-type of agent has no clear task, and shouldn’t necessarily minimise the number of turns!

    • For these domains, the notion of «appropriateness», i.e. the system’s ability to maintain a natural, fluid conversation over time might be more important

    10

    [D. Traum et al. (2004), «Evaluation of multi-party virtual reality dialogue interaction», in Proceedings of LREC]

    lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Outline

    • Evaluation of dialogue systems•Wrap-up• Questions, comments?

    11lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Wrap-up

    12

    Spoken dialogue is...

    a joint activity1 uncertain2

    structured3contextual4

    goal-driven5

    lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Dialogue is a joint activity

    13

    The dialogue participants take turns and perform dialogue acts

    They interpret each others utterances cooperatively

    They cooperate to achieve mutual understanding (common ground)

    They naturally align their way of speaking

    lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Dialogue is uncertain

    • Uncertainty is everywhere in dialogue:

    • Speech recognition is always error-prone!

    • Ambiguities are part and parcel of human language

    • User behaviours are often unpredictable

    • The dialogue context is rarely crystal clear

    14

    That’s why probabilistic modelling is instrumental in the design of robust dialogue systems

    lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Dialogue is structured

    • Dialogue is full of intricate structures:• Syntactic and prosodic phrases• Semantic relations within an utterance• Pragmatic relations between utterances / dialogue acts• References to entities, persons, places, events• Attentional and intentional structure

    15lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    A small experiment

    16

    A small test to check how well you can concentrate on a task!

    lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Dialogue is contextual

    • Context is crucial for dialogue processing:• Pronunciation varies depending on the context• Non-sentential utterances only make sense in a situation• Omnipresence of deictics in dialogue

    17

    • Dialogue management & generation must adapt to contextual factors

    • More than just «context-sensitivity»: selective attention actually drives the processing

    lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Dialogue is goal-driven

    • We communicate to do things in the world

    • Dialogue acts guided by intentions and provoking effects

    • Verbal and non-verbal actions are intertwined

    18

    decision-theoretic

    • Dialogue participants have multiple competing goals to fulfill, leading to a problem of utility maximisation

    • To this end, they must plan their actions over time

    lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Outline

    • Evaluation of dialogue systems• Wrap-up•Questions, comments?

    19lørdag 13. oktober 2012

    @ 2012, Pierre Lison - INF5820 course

    Practical details

    • I will provide a detailed list of sections to read in the Martin & Jurafsky book

    • You can find more explanations there if the slides are unclear• You only have to study what we have seen in the lectures

    (e.g. what is on the slides)

    • Some material in the slides are not in the M&J book (these are of course part of the pensum!)

    • I will also provide a mock-up exam, so you know «what to expect» for the final exam

    20lørdag 13. oktober 2012

  • @ 2012, Pierre Lison - INF5820 course

    Questions, comments

    • Questions on the course?• Problems you are struggling with?• Comments, advices on my teaching style?

    21lørdag 13. oktober 2012