Emotional Grounding in Spoken Dialog Systems Jackson Liscombe [email protected] Giuseppe...

Emotional Grounding in Spoken Dialog

SystemsJackson Liscombe

[email protected]

Giuseppe Riccardi Dilek Hakkani-Tü[email protected] [email protected]

10.14.04 Jackson Liscombe -- CU / AT&T 2

In Spoken Dialog Systems, users can …

… start angry.

… get angry.

… end angry.

The Problem: Emotion


Outline

Previous Work

Corpus Description

Feature Extraction

Classification Experiments


Outline

Previous Work

Corpus Description

Feature Extraction



Past Work

I. Isolated Speech

II. Spoken Dialog Systems


Past Work: Isolated Speech

Acted Data Features:

F0/pitch energy speaking rate

Researchers (late 1990s - present) Aubergé, Campbell, Cowie, Douglas-Cowie,

Hirscheberg, Liscombe, Mozziconacci, Oudeyer, Pereira, Roach, Scherer, Schröder, Tato, Yuan, Zetterholm, …


Past Work: Spoken Dialog Systems (1) Batliner, Huber, Fischer, Spilker, Nöth (2003)

system = Verbmobil (Wizard of Oz scenarios) binary classification features:

prosodic lexical (POS tags, swear words) dialog acts (repeat/repair/insult)

0.1% relative improvement using dialog acts


Past Work: Spoken Dialog Systems (2) Ang, Dhillon, Krupski, Shriberg, Stolcke (2002)

system = DARPA Communicator binary classification features:

prosodic lexical (language model) dialog acts (repeats/repairs)

4% relative improvement using dialog acts


Past Work: Spoken Dialog Systems (3) Lee, Narayanan (2004) system = Speechworks call-center

binary classification features:

prosodic lexical (weighted mutual information) dialog acts (repeat/rejection)

3% improvement using dialog acts


Past Work: Summary

Past research has focused on acoustic data

But, moving toward grounding emotion in context (dialogs acts)

Summer work = extend contextual features for better emotion prediction


Outline

Previous Work

Corpus Description

Feature Extraction



Corpus Description

AT&T’s “How May I Help You?SM” corpus (0300 Benchmark)

Labeled with “Voice Signature” information: user state (emotion) gender age accent type


Corpus Description

Statistic Training Testing

number user turns 15,013 5,000

number of dialogs 4,259 1,431

number of turns per dialog 3.5 3.5

number of words per turn 9.0 9.9


User Emotion Distribution

0

10

20

30

40

50

60

70

80

90

Percent

Other Very Neg Very Frust Very Angry OtherSomewhat Neg

SomewhatAngry

SomewhatFrust

Positve/Neutral

User State

Emotion Label Distribution


Emotion Labels

Original Set: Positive/Neutral

Somewhat Frustrated Very Frustrated Somewhat Angry Very Angry Other Somewhat

Negative Very Negative

Reduced Set: Positive

Negative


Corpus Description: Binary User States

Statistic Training Testing

% of turns that are positive 88.1% 73.1%

% of dialogs with at least one negative turn

24.8% 44.7%

% of negative dialogs that start negative

43.5% 59.9%

% of negative dialogs that end negative

42.4% 48.7%


Outline

Previous Work

Corpus Description

Feature Extraction



Feature Set Space

Features

Context

Prosodic Lexical Discourse

turni

turni-1 turni

turni-2 turni-1

… … … …


Feature Set Space: Context Overview Features

Context


turni Isolated

turni-1 turni

Differentials Prior Statisticsturni-2 turni-1

…


Lexical Features

Language Model (ngrams) Examples of words significantly correlated

with negative user state (p<0.001) : 1st person pronouns: ‘I’, ‘me’ requests for a human operator: ‘person’, ‘talk’,

‘speak’, ‘human’, ‘machine’ billing-related words: ‘dollars’, ‘cents’ curse words: …


Prosodic Features

Praat open source tool for speech analysis, synthesis,

statistics, manipulation, … Paul Boersma and David Weenink University of Amsterdam www.praat.org


Prosodic Features

Pitch (F0)1. overall minimum

2. overall maximum

3. overall median

4. overall standard deviation

5. mean absolute slope

6. slope of final vowel

7. longest vowel mean

Other8. local jitter over longest

vowel

Energy9. overall minimum

10. overall maximum

11. overall mean

12. overall standard deviation

13. longest vowel mean

Speaking Rate14. vowels per second

15. mean vowel length

16. ratio voiced frames to total frames

17. percent internal silence

Prosodic Features in Action


“That amount is incorrect.”

A View Of Pitch

20020508/0508095935fhf2910

Very Frustrated

Somewhat Frustrated


Person Transcription Emotion Play

… … …

Agent:I'm sorry, I didn't understand that. How may I help you?

User:please an operator please [ .hst ] may I

speak-Positive/Neutral

Agent:I'm sorry we're having trouble. Can we try one more time? Just briefly tell me

how I may help you.

User: may I speak to an assistant please Somewhat Frustrated

Agent: Are you calling from your home phone?

User: yes Somewhat Angry

20020221/0221080552atf1536

Pitch/Energy/Speaking Rate : Change


20020221/0221080552atf1536

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Z score

Positive Frustrated Angry

Utterance

Median Pitch Mean Energy Speaking Rate

Pitch/Energy/Speaking Rate : Change


Person Transcription Emotion Play

Agent:Hello, I am an AT&T automated assistant. You can speak naturally to me. How may I help you ?

User:I need to find out about a number that I

don't recognizePositive/Neutral

Agent:Would you like to look up a number

you don't recognize on you bill?

User: yes I would Positive/Neutral

Agent: Are you calling from your home phone?

User: yes I am Positive/Neutral

20020317/0317220739atf9051

Pitch/Energy/Speaking Rate : No Change


-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Z score

Positive Positive Positive

Utterance

Median Pitch Mean Energy Speaking Rate

20020317/0317220739atf9051

Pitch/Energy/Speaking Rate : No Change


Feature Set Space: Baseline

Features

Context


turni

turni-1 turni

turni-2 turni-1

… … … …


Discourse Features

Dialog Acts greeting re-prompt confirmation specification acknowledgment disambiguation …


Feature Set Space: State-of-the-Art Features

Context


turni

turni-1 turni

turni-2 turni-1

… … … …


Contextual Features

Lexical (2) edit distance with

previous 2 turns

Discourse (10) turn number call type repetition

with previous 2 turns dialog act repetition

with previous 2 turns

Prosodic (34) 1st and 2nd order

differentials for each feature

Other (2) user state of previous

2 turns


Feature Set Space: Contextual Features

Context


turni

turni-1 turni

turni-2 turni-1

… … … …


Outline

Previous Work

Corpus Description

Feature Extraction



Experimental Design

Training size = 15,013 turns Testing size = 5,000 turns Most frequent user state (positive) accounts

for 73.1% of testing data Learning Algorithm Used:

BoosTexter (boosting w/ weak learners) continuous and discrete valued features 2000 iterations


Performance Accuracy Summary

Feature Set AccuracyRel. Improv. over Baseline

Most Freq. State 73.1% -----

Baseline 76.1% -----

State-of-the-Art 77.0% 1.2%

Contextual 79.0% 3.8%


Conclusions

Baseline (prosodic and lexical features) leads to improved emotion prediction over chance

State-of-the-Art (baseline plus dialog acts) gives further improvement

Innovative contextual features: improves emotion prediction even further

Towards a computation model of emotional grounding

Thank You

Emotional Grounding in Spoken Dialog Systems Jackson Liscombe [email protected] Giuseppe...

Documents

Transcript of Emotional Grounding in Spoken Dialog Systems Jackson Liscombe [email protected] Giuseppe...