Emotional Grounding in Spoken Dialog Systems Jackson Liscombe [email protected] Giuseppe...
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Emotional Grounding in Spoken Dialog Systems Jackson Liscombe [email protected] Giuseppe...
Emotional Grounding in Spoken Dialog
SystemsJackson Liscombe
Giuseppe Riccardi Dilek Hakkani-Tü[email protected] [email protected]
10.14.04 Jackson Liscombe -- CU / AT&T 2
In Spoken Dialog Systems, users can …
… start angry.
… get angry.
… end angry.
The Problem: Emotion
10.14.04 Jackson Liscombe -- CU / AT&T 3
Outline
Previous Work
Corpus Description
Feature Extraction
Classification Experiments
10.14.04 Jackson Liscombe -- CU / AT&T 4
Outline
Previous Work
Corpus Description
Feature Extraction
Classification Experiments
10.14.04 Jackson Liscombe -- CU / AT&T 6
Past Work: Isolated Speech
Acted Data Features:
F0/pitch energy speaking rate
Researchers (late 1990s - present) Aubergé, Campbell, Cowie, Douglas-Cowie,
Hirscheberg, Liscombe, Mozziconacci, Oudeyer, Pereira, Roach, Scherer, Schröder, Tato, Yuan, Zetterholm, …
10.14.04 Jackson Liscombe -- CU / AT&T 7
Past Work: Spoken Dialog Systems (1) Batliner, Huber, Fischer, Spilker, Nöth (2003)
system = Verbmobil (Wizard of Oz scenarios) binary classification features:
prosodic lexical (POS tags, swear words) dialog acts (repeat/repair/insult)
0.1% relative improvement using dialog acts
10.14.04 Jackson Liscombe -- CU / AT&T 8
Past Work: Spoken Dialog Systems (2) Ang, Dhillon, Krupski, Shriberg, Stolcke (2002)
system = DARPA Communicator binary classification features:
prosodic lexical (language model) dialog acts (repeats/repairs)
4% relative improvement using dialog acts
10.14.04 Jackson Liscombe -- CU / AT&T 9
Past Work: Spoken Dialog Systems (3) Lee, Narayanan (2004) system = Speechworks call-center
binary classification features:
prosodic lexical (weighted mutual information) dialog acts (repeat/rejection)
3% improvement using dialog acts
10.14.04 Jackson Liscombe -- CU / AT&T 10
Past Work: Summary
Past research has focused on acoustic data
But, moving toward grounding emotion in context (dialogs acts)
Summer work = extend contextual features for better emotion prediction
10.14.04 Jackson Liscombe -- CU / AT&T 11
Outline
Previous Work
Corpus Description
Feature Extraction
Classification Experiments
10.14.04 Jackson Liscombe -- CU / AT&T 12
Corpus Description
AT&T’s “How May I Help You?SM” corpus (0300 Benchmark)
Labeled with “Voice Signature” information: user state (emotion) gender age accent type
10.14.04 Jackson Liscombe -- CU / AT&T 13
Corpus Description
Statistic Training Testing
number user turns 15,013 5,000
number of dialogs 4,259 1,431
number of turns per dialog 3.5 3.5
number of words per turn 9.0 9.9
10.14.04 Jackson Liscombe -- CU / AT&T 14
User Emotion Distribution
0
10
20
30
40
50
60
70
80
90
Percent
Other Very Neg Very Frust Very Angry OtherSomewhat Neg
SomewhatAngry
SomewhatFrust
Positve/Neutral
User State
Emotion Label Distribution
10.14.04 Jackson Liscombe -- CU / AT&T 15
Emotion Labels
Original Set: Positive/Neutral
Somewhat Frustrated Very Frustrated Somewhat Angry Very Angry Other Somewhat
Negative Very Negative
Reduced Set: Positive
Negative
10.14.04 Jackson Liscombe -- CU / AT&T 16
Corpus Description: Binary User States
Statistic Training Testing
% of turns that are positive 88.1% 73.1%
% of dialogs with at least one negative turn
24.8% 44.7%
% of negative dialogs that start negative
43.5% 59.9%
% of negative dialogs that end negative
42.4% 48.7%
10.14.04 Jackson Liscombe -- CU / AT&T 17
Outline
Previous Work
Corpus Description
Feature Extraction
Classification Experiments
10.14.04 Jackson Liscombe -- CU / AT&T 18
Feature Set Space
Features
Context
Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
… … … …
10.14.04 Jackson Liscombe -- CU / AT&T 19
Feature Set Space: Context Overview Features
Context
Prosodic Lexical Discourse
turni Isolated
turni-1 turni
Differentials Prior Statisticsturni-2 turni-1
…
10.14.04 Jackson Liscombe -- CU / AT&T 20
Lexical Features
Language Model (ngrams) Examples of words significantly correlated
with negative user state (p<0.001) : 1st person pronouns: ‘I’, ‘me’ requests for a human operator: ‘person’, ‘talk’,
‘speak’, ‘human’, ‘machine’ billing-related words: ‘dollars’, ‘cents’ curse words: …
10.14.04 Jackson Liscombe -- CU / AT&T 21
Prosodic Features
Praat open source tool for speech analysis, synthesis,
statistics, manipulation, … Paul Boersma and David Weenink University of Amsterdam www.praat.org
10.14.04 Jackson Liscombe -- CU / AT&T 22
Prosodic Features
Pitch (F0)1. overall minimum
2. overall maximum
3. overall median
4. overall standard deviation
5. mean absolute slope
6. slope of final vowel
7. longest vowel mean
Other8. local jitter over longest
vowel
Energy9. overall minimum
10. overall maximum
11. overall mean
12. overall standard deviation
13. longest vowel mean
Speaking Rate14. vowels per second
15. mean vowel length
16. ratio voiced frames to total frames
17. percent internal silence
10.14.04 Jackson Liscombe -- CU / AT&T 24
“That amount is incorrect.”
A View Of Pitch
20020508/0508095935fhf2910
Very Frustrated
Somewhat Frustrated
10.14.04 Jackson Liscombe -- CU / AT&T 25
Person Transcription Emotion Play
… … …
Agent:I'm sorry, I didn't understand that. How may I help you?
User:please an operator please [ .hst ] may I
speak-Positive/Neutral
Agent:I'm sorry we're having trouble. Can we try one more time? Just briefly tell me
how I may help you.
User: may I speak to an assistant please Somewhat Frustrated
Agent: Are you calling from your home phone?
User: yes Somewhat Angry
20020221/0221080552atf1536
Pitch/Energy/Speaking Rate : Change
10.14.04 Jackson Liscombe -- CU / AT&T 26
20020221/0221080552atf1536
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Z score
Positive Frustrated Angry
Utterance
Median Pitch Mean Energy Speaking Rate
Pitch/Energy/Speaking Rate : Change
10.14.04 Jackson Liscombe -- CU / AT&T 27
Person Transcription Emotion Play
Agent:Hello, I am an AT&T automated assistant. You can speak naturally to me. How may I help you ?
User:I need to find out about a number that I
don't recognizePositive/Neutral
Agent:Would you like to look up a number
you don't recognize on you bill?
User: yes I would Positive/Neutral
Agent: Are you calling from your home phone?
User: yes I am Positive/Neutral
20020317/0317220739atf9051
Pitch/Energy/Speaking Rate : No Change
10.14.04 Jackson Liscombe -- CU / AT&T 28
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Z score
Positive Positive Positive
Utterance
Median Pitch Mean Energy Speaking Rate
20020317/0317220739atf9051
Pitch/Energy/Speaking Rate : No Change
10.14.04 Jackson Liscombe -- CU / AT&T 29
Feature Set Space: Baseline
Features
Context
Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
… … … …
10.14.04 Jackson Liscombe -- CU / AT&T 30
Discourse Features
Dialog Acts greeting re-prompt confirmation specification acknowledgment disambiguation …
10.14.04 Jackson Liscombe -- CU / AT&T 31
Feature Set Space: State-of-the-Art Features
Context
Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
… … … …
10.14.04 Jackson Liscombe -- CU / AT&T 32
Contextual Features
Lexical (2) edit distance with
previous 2 turns
Discourse (10) turn number call type repetition
with previous 2 turns dialog act repetition
with previous 2 turns
Prosodic (34) 1st and 2nd order
differentials for each feature
Other (2) user state of previous
2 turns
10.14.04 Jackson Liscombe -- CU / AT&T 33
Feature Set Space: Contextual Features
Context
Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
… … … …
10.14.04 Jackson Liscombe -- CU / AT&T 34
Outline
Previous Work
Corpus Description
Feature Extraction
Classification Experiments
10.14.04 Jackson Liscombe -- CU / AT&T 35
Experimental Design
Training size = 15,013 turns Testing size = 5,000 turns Most frequent user state (positive) accounts
for 73.1% of testing data Learning Algorithm Used:
BoosTexter (boosting w/ weak learners) continuous and discrete valued features 2000 iterations
10.14.04 Jackson Liscombe -- CU / AT&T 36
Performance Accuracy Summary
Feature Set AccuracyRel. Improv. over Baseline
Most Freq. State 73.1% -----
Baseline 76.1% -----
State-of-the-Art 77.0% 1.2%
Contextual 79.0% 3.8%
10.14.04 Jackson Liscombe -- CU / AT&T 37
Conclusions
Baseline (prosodic and lexical features) leads to improved emotion prediction over chance
State-of-the-Art (baseline plus dialog acts) gives further improvement
Innovative contextual features: improves emotion prediction even further
Towards a computation model of emotional grounding