Vocal Emotion Recognition Overview Emotion Recognition State-of-the-Art in Classification of...
-
Upload
hoangkhanh -
Category
Documents
-
view
223 -
download
0
Transcript of Vocal Emotion Recognition Overview Emotion Recognition State-of-the-Art in Classification of...
Vocal Emotion RecognitionState-of-the-Art in Classification of Real-Life Emotions
October 26, 2010
Stefan Steidl
International Computer Science Institute (ICSI)
at Berkeley, CA
2 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
4 INTERSPEECH 2009 Emotion Challenge
S. Steidl: Vocal Emotion Recognition
3 / 49
Overview
1 Different Perspectives on Emotion RecognitionPsychology of EmotionComputer Science
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
4 INTERSPEECH 2009 Emotion Challenge
S. Steidl: Vocal Emotion Recognition
4 / 49
Facial Expressions of Emotion
S. Steidl: Vocal Emotion Recognition
5 / 49
Universal Basic Emotions
Paul Ekman
postulates the existence of6 basic emotions:
anger, fear, disgust,surprise, joy, sadness
other emotions are mixed or blendedemotionsuniversal facial expressions
S. Steidl: Vocal Emotion Recognition
6 / 49
Terminology
Different affective states [1]:
type of affective state inten- dura- syn- event appraisal rapid- behav-sity tion chroni- focus elicita- ity of ioral
zation tion change impactemotion :: - ::: : ::: ::: ::: ::: :::
mood : - :: :: : : : :: :
interpersonal stances : - :: : - :: : :: : ::: ::
attitudes ◦ - :: :: - ::: ◦ ◦ : ◦ - : :
personality traits ◦ - : ::: ◦ ◦ ◦ ◦ :
◦: low, :: medium, ::: high, :::: very high, -: indicates a range
[1] K. R. Scherer: Vocal communication of emotion: A review of researchparadigms, Speech Communication, Vol. 40, pp. 227-256, 2003
S. Steidl: Vocal Emotion Recognition
7 / 49
Terminology (cont.)
Definition of Emotion
Emotion (Scherer)
episodes of coordinated changes in several components including at least:
neurophysiological activation,
motor expression, and
subjective feeling but possibly also
action tendencies and cognitive processes
in response to external or internal events of major significance to theorganism
S. Steidl: Vocal Emotion Recognition
8 / 49
Vocal Expression of Emotion
Results from studies in Psychology of Emotion
anger/ fear/ sadness joy/ boredom stressrage panic elation
Intensity Ú Ú Ø Ú ÚF0 floor/mean Ú Ú Ø Ú ÚF0 variability Ú Ø Ú Ø
F0 range Ú Ú(Ø)1 Ø Ú ØSentence contour Ø Ø
High frequency energy Ú Ú Ø (Ú)2
Speech and articulation rate Ú Ú Ø (Ú)2 Ø
1 Banse and Scherer found a decrease in F0 range2 inconclusive evidence
Goal
Classification of the subject’s actual emotional state(some sort of lie detector for emotions)
S. Steidl: Vocal Emotion Recognition
9 / 49
Human-Computer Interaction (HCI)
Emotion-Related User States
naturally occurring states of users in human-machinecommunicationemotions in a broader sensecoordinated changes in several components NOT requiredclassification of the perceived emotional state,not necessarily the actual emotion of the speaker
S. Steidl: Vocal Emotion Recognition
10 / 49
Pattern Recognition
Pattern Recognition Point of View
classification task: choose 1 of n given classesdiscrimination of classes rather than classificationdefinition of “good” featuresmachine classification
Actually not neededdefinition of term emotioninformation on how specific features change
S. Steidl: Vocal Emotion Recognition
11 / 49
Emotional Speech Corpora
Acted data
based on Basic Emotions theorysuited for studying prototypical emotionscorpora easy to create (inexpensive, no labeling process)high audio qualitybalanced classesneutral linguistic content (focus on acoustics only)high recognition results
S. Steidl: Vocal Emotion Recognition
12 / 49
Emotional Speech Corpora (cont.)
Popular corpora
Emotional Prosody Speech and Transcript corpus (LDC): 15classes
Berlin Emotional Speech Database (EmoDB): 7 classes
89.9 % accuracy (speaker independent LOSO evaluation, speakeradaptation, feature selection) [2]
Danish Emotional Speech Corpus: 5 classes
74.5 % accuracy (10-fold SCV, feature selection) [3]
[2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for RobustRecognition of Emotions within Speech, INTERSPEECH 2007
[3] Schuller et al.: Emotion Recognition in the Noise Applying Large AcousticFeature Sets, Speech Prosody 2006
S. Steidl: Vocal Emotion Recognition
13 / 49
Emotional Speech Corpora (cont.)
Naturally occurring emotions
states that actually appear in HCI (real applications)difficult to create (appropriate scenario needed, ethical concerns,need to label data)low emotional intensityin general ≥ 80% neutrallow audio quality (reverberation, noise, far-distance microphones)needed for machine classification (because conditions betweentraining and test must not differ too much)research on both acoustic and linguistic features possiblenew research questions: optimal emotion unitalmost no corpora large enough for machine classificationavailable (do not exist or are not available for research)
S. Steidl: Vocal Emotion Recognition
14 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion CorpusScenarioLabeling of User StatesData-driven Dimensions of EmotionUnits of AnalysisSparse Data Problem
3 Own Results on Emotion Classification
4 INTERSPEECH 2009 Emotion Challenge
S. Steidl: Vocal Emotion Recognition
15 / 49
The FAU Aibo Emotion Corpus
51 children (30 f, 21 m) at the age of 10 to 13
8.9 hours of spontaneous speech (mainly short commands)
48,401 words in 13,642 audio files
S. Steidl: Vocal Emotion Recognition
16 / 49
FAU Aibo Emotion Corpus (cont.)
data base for CEICES and INTERSPEECH 2009 EmotionChallenge
available for scientific, non-commercial usehttp://www5.cs.fau.de/FAUAiboEmotionCorpus
[4] S. Steidl: Automatic Classification of Emotion-Related User Statesin Spontaneous Children’s Speech, Logos Verlag, Berlin
available online:http://www5.cs.fau.de/en/our-team/steidl-stefan/dissertation/
S. Steidl: Vocal Emotion Recognition
17 / 49
Emotion-Related User States
11 categories: prior inspection of the data before labeling
joyfulsurprisedmothereseneutralboredemphatichelplesstouchy/irritatedreprimandingangry
other
motherese
the way mothers/parents address their babies –either because Aibo is well-behaving or becausethe child wants Aibo to obey; positive equivalent toreprimanding
emphatic
pronounced, accentuated, sometimeshyper-articulated way but without showing anyemotion
reprimanding
the child is reproachful, reprimanding, ‘wags thefinger’
S. Steidl: Vocal Emotion Recognition
18 / 49
Labeling of User States
Labeling:5 students of linguisticsholistic labeling on the word levelmajority vote
emotion category wordsangry (A) 134 0.3 %touchy (T) 419 0.9 %reprimanding (R) 463 1.0 %emphatic (E) 2,807 5.8 %neutral (N) 39,975 82.6 %motherese (M) 1,311 2.7 %joyful (J) 109 0.2 %...all 48,401 100.0 %
S. Steidl: Vocal Emotion Recognition
19 / 49
Labeling of User States (cont.)
Confusion matrix
emotion category A T R E N M J
maj
ority
vote
angry (A) 43.3 13.0 12.9 12.1 18.1 0.1 0.0touchy (T) 4.5 42.9 11.7 13.7 23.5 1.0 0.1reprimanding (R) 3.8 15.7 45.8 14.0 18.2 1.3 0.1emphatic (E) 1.3 5.8 6.7 53.6 29.9 1.2 0.5neutral (N) 0.4 2.2 1.5 13.9 77.8 2.7 0.5motherese (M) 0.0 0.8 1.4 4.9 30.4 61.1 0.9joyful (J) 0.1 0.6 1.1 7.3 32.4 2.0 54.2
S. Steidl: Vocal Emotion Recognition
20 / 49
Data-driven Dimensions of Emotions
Non-metric dimensional scaling:arranging the emotion categories in the 2-dimensional spacestates that are often confused are close to each other
negative positive
valence
−interaction
+interaction
inte
ract
ion
angry
touchy
motherese
neutral
joyful
reprimanding
emphatic
S. Steidl: Vocal Emotion Recognition
21 / 49
Units of Analysis
Units of analysis
Aibo fein dasdumachstg’radeaus
word level
chunk level
turn level
stopp sitzstopp
Ohm_18_343Ohm_18_342
v1 v2 s3p3
Advantages/disadvantages of larger units+ more information− less emotional homogeneity
S. Steidl: Vocal Emotion Recognition
22 / 49
Sparse Data Problem
Super classes:
Anger: angry, touchy/irritated, reprimandingEmphaticNeutralMotherese
0.5
-1
-0.5
0
1
0.5
0
1
-0.5
-1
0 0.5 1 1.5-1-1.5 -0.5 -1.5 -1 0.5 1 1.50-0.5
angr
y
repr
iman
ding
neut
ral
mot
here
se
touc
hy
emph
atic
joyf
ul
S = 0.32 RSQ = 0.73
Neutral
Anger
Motherese
S = 0.19 RSQ = 0.90
Emphatic
S. Steidl: Vocal Emotion Recognition
23 / 49
Sparse Data Problem (cont.)
Data subsets
Aibo word setAibo chunk setAibo turn setAibo corpus
data set number of taken fromwords # chunks # turns
Aibo corpus 48,401 18,216 13,642Aibo word set 6,070 4,543 3,996Aibo chunk set 13,217 4,543 3,996Aibo turn set 17,618 6,413 3,996
S. Steidl: Vocal Emotion Recognition
24 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion ClassificationResults for different Units of AnalysisMachine vs. HumanFeature Types and their Relevance
4 INTERSPEECH 2009 Emotion Challenge
S. Steidl: Vocal Emotion Recognition
25 / 49
Most Appropriate Unit of Analysis
Classificationcomplete set of featuresclassification with Linear Discriminant Analysis (LDA)51-fold speaker-independent cross-validation
unit of number of number of averageanalysis features samples recallword level 265 6,070 words 67.2 %chunk level 700 4,543 chunks 68.9 %turn level 700 3,996 turns 63.2 %
Chunks: best compromise betweenlength of the segmenthomogeneity of the emotional state within the segment
S. Steidl: Vocal Emotion Recognition
26 / 49
Machine Classifier vs. Human Labeler
Entropy based measure:
A E MN
0.00.250.75 0.0
1234
AEA
classlabeler
A
12→
+
A E N M
0.00.0 1.0decoder:
A N ME
0.0 0.50.375 0.125
12
Hdec = 1.41
→M0.0
implicit weighting of classification ‘errors’depending on the word that is classified
S. Steidl: Vocal Emotion Recognition
27 / 49
Machine Classifier vs. Human Labeler (cont.)
Classification: Aibo word set
avg. human labelermachine classifier0.2
0.15
0.1
0
0.05
0.25
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6entropy
rel.
frequ
ency
[%]
[5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann:“Of All Things the Measure is Man” – Classification of Emotions andInter-Labeler Consistency, ICASSP 2005
S. Steidl: Vocal Emotion Recognition
28 / 49
Evaluation of Different Types of Features
Types of featuresacoustic features
prosodic featuresspectral featuresvoice quality features
linguistic features
EvaluationArtificial Neural Networks (ANN)51-fold speaker-independent cross-validationcombination by early or late fusion
S. Steidl: Vocal Emotion Recognition
29 / 49
Acoustic Features: Prosody
Prosody
suprasegmental characteristics such aspitch contourenergy contourtemporal shortening/lengthening of wordsduration of pauses between words
S. Steidl: Vocal Emotion Recognition
30 / 49
Acoustic Features: Prosody (cont.)
Classification results: Aibo chunk set
0
50.654.4
58.5 59.0
10
20
30
40
50
60
70
80
42.0
F0 (29)
duration (37)
energy (25)
allpauses (16)
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
31 / 49
Acoustic Features: Spectral Characteristics (cont.)
Classification results: Aibo chunk set
40
50
60
70
80
20
30
10
0
59.0 58.9
48.2
prosody (107)
MFCC(24)
best combination
HNR(2)
jitter/shimmer (4)
formants (16)
TEO(64)
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
32 / 49
Acoustic Features: Voice Quality
Classification results: Aibo chunk set
40
50
60
70
80
20
30
10
0
59.0 58.9
48.2 47.0
32.5
52.3
prosody (107)
MFCC(24)
formants (16)
jitter/shimmer (4)
HNR(2)
TEO(64)
best combination
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
33 / 49
Acoustic Features: Combination
Classification results: Aibo chunk set
40
50
60
70
80
20
30
10
0
59.0 58.9
48.2 47.0
32.5
52.3
65.4
prosody (107)
MFCC(24)
formants (16)
jitter/shimmer (4)
HNR(2)
best combination
TEO(64)
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
34 / 49
Linguistic Features
Types of linguistic features
word characteristicsaverage word length (number of letters, phonemes, syllables)proportion of word fragmentsaverage number of repetitions
part-of-speech features
unigram models
bag-of-words
S. Steidl: Vocal Emotion Recognition
35 / 49
Linguistic Features (cont.)
Part-of-Speech (POS) Featuresonly 6 coarse POS categoriescan be annotated without considering context
Anger
Emphatic
Neutral
Motherese
Joyful
Other-
%of
tota
l
nouns, proper names
inflected adjectives
particles, interjectionsarticles, pronouns,
auxiliaries
present/past participlesnot inflected adjectives
(other) verbs, infinitives
S. Steidl: Vocal Emotion Recognition
36 / 49
Linguistic Features (cont.)
Unigram Models
u(w ,e) = log10P(e|w)
P(e)
Anger P(A|w) Emphatic P(E|w)böser (bad) 29.2 % stopp (stop) 30.5 %stehenbleiben (stop) 18.9 % halt (halt) 29.3 %nein (no) 17.0 % links (left) 20.5 %aufstehen (get up) 12.3 % rechts (right) 18.9 %Aibo (Aibo) 10.1 % nein (no) 17.6 %
Neutral P(N|w) Motherese P(M|w)okay (okay) 98.6 % fein (fine) 57.5 %und (and) 98.5 % ganz (very) 41.9 %Stück (bit) 98.5 % braver (good) 36.0 %in (in) 98.2 % sehr (very) 23.5 %noch (still) 96.2 % brav (good) 21.7 %
S. Steidl: Vocal Emotion Recognition
37 / 49
Linguistic Features (cont.)
Bag-of-Words
14. . . 0 0 1
414
14
Aibolein
allen
. . . . . . . . . . . .
utterance: Aibo, geh nach links! (Aibo, move to the left!)
Aibo geh nach links
representation of the linguistic contentword order getting lostvarious dimensionality reduction techniques
S. Steidl: Vocal Emotion Recognition
38 / 49
Linguistic Features (cont.)
Classification results: Aibo chunk set80
70
60
50
40
30
20
10
0
54.3 56.1
61.9 61.9 62.2
POS(6)
unigrammodels (16)
word statistics (6)
best combination
BOW(254→
50)
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
39 / 49
Combination of Acoustic and Linguistic Features
Classification results: Aibo chunk set
65.4 62.267.1 68.9
80
70
60
50
40
30
20
10
0 best combination
(early fusion, LDA)
(late fusion, ANN)
acoustic features
(late fusion, ANN)
linguistic features
best combination
(late fusion, ANN)
combination
combination
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
40 / 49
Similar Results within CEICES
CEICES: Combining Efforts for Improving Automatic Classification of EmotionalUser States
collaboration of various research groups within the EuropeanNetwork of Excellence HUMAINE (2004-2007)state-of-the-art feature set with ≥ 4,000 features
SVM (linear kernel), 3-fold speaker-independent cross-validation
selection of 150 features (SFFS): surviving feature types?
only chunk based features, no information outside Aibo chunk set
[6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L.Vidrascu, V. Aharonson, L. Kessous, N. Amir:
Whodunnit – Searching for the Most Important Feature Types SignallingEmotion-Related User States in Speech,
Computer, Speech, and Language, Vol. 25, Issue 1 (January 2011), pp. 4-28
S. Steidl: Vocal Emotion Recognition
41 / 49
Similar Results within CEICES(cont.)
dura
tion
ener
gy
F 0 spec
trum
ceps
trum
voic
equ
ality
wav
elet
s
alla
cous
tic
BO
W
PO
S
high
erse
man
tics
varia
alll
ingu
istic
all
# total 391 265 333 656 1699 153 216 3713 476 31 12 12 531 4244
SFF
S
# 10 32 16 15 16 7 5 101 25 7 17 0 49 150F MEASURE 49.6 56.3 46.8 46.2 46.4 38.7 35.3 – 37.4 48.1 56.0 – – 65.5SHARE 6.7 21.3 10.7 10.0 10.7 4.7 3.4 67.3 16.7 4.7 11.3 0.0 32.7 100.0PORTION 2.6 12.1 4.8 2.3 1.0 4.6 2.3 2.7 5.3 22.6 141.7 0.0 9.6 3.5
SFF
S
# 28 33 23 17 23 11 15 150 94 27 27 2 150F MEASURE 54.9 56.9 46.7 49.9 50.4 41.5 44.9 63.4 53.2 54.9 57.9 – 62.6SHARE 18.7 22.0 15.3 11.3 15.3 7.3 10.0 100.0 62.7 18.0 18.0 0.1 100.0PORTION 7.2 12.5 6.9 2.6 1.4 7.2 6.9 4.0 19.7 87.1 225.0 16.7 28.2
S. Steidl: Vocal Emotion Recognition
42 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
4 INTERSPEECH 2009 Emotion Challenge
S. Steidl: Vocal Emotion Recognition
43 / 49
INTERSPEECH 2009 Emotion Challenge
New goals:
challenge with standardized test conditions
open microphone: using the complete corpushighly unbalanced classesincluding all observed emotional categoriesincluding chunks with low inter-labeler agreement
S. Steidl: Vocal Emotion Recognition
44 / 49
INTERSPEECH 2009 Emotion Challenge (cont.)
Speaker independent training and test sets
2-class problem: NEGative vs. IDLe
# NEG IDL∑
train 3 358 6 601 9 959test 2 465 5 792 8 257∑
5 823 12 393 18 216
5-class problem: Anger, Emphatic, Neutral, Positive, Rest
# A E N P R∑
train 881 2 093 5 590 674 721 9 959test 611 1 508 5 377 215 546 8 257∑
1 492 3 601 10 967 889 1 267 18 216
S. Steidl: Vocal Emotion Recognition
45 / 49
INTERSPEECH 2009 Emotion Challenge (cont.)
Sub-Challenges
1 Feature Sub-Challengeoptimisation of feature extraction/selection;classifier settings fixed
2 Classifier Sub-Challengeoptimisation of classification techniques;feature set given
3 Open Performance Sub-Challengeoptimisation of feature extraction/selection andclassification techniques
S. Steidl: Vocal Emotion Recognition
46 / 49
INTERSPEECH 2009 Emotion Challenge (cont.)
Participants
Open Performance Classifier FeatureSub-Challenge Sub-Challenge Sub-Challenge number of
2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants3 3 – – – – 73 – – – – – 2– – 3 3 – – 2– – – 3 – – 1– – – 3 3 3 1– – – – 3 3 1
[7] B. Schuller, A. Batliner, S. Steidl, D. Seppi:Recognising Realistic Emotions and Affect in Speech: State of the Art andLessons Learnt from the First Challenge, Speech Communication, Special IssueSensing Emotion and Affect - Facing Realism in Speech Processing, to appear
S. Steidl: Vocal Emotion Recognition
47 / 49
INTERSPEECH 2009 Emotion Challenge (cont.)
2-class problem: NEGative vs. IDLe
unweighted avg. recallweighted avg. recall
60
62
64
68
70
72
74
66
71.270.3
69.268.367.967.667.267.1
67.766.4
Barra-Chicote et al.
Polzehl et al.
Vogt et al.
Bozkurt et al.
Luengo et al.
Kockmann et al.
Vlasenko et al.
Dumouchel et al.
Majority voting
Baseline
aver
age
reca
ll[%
]
S. Steidl: Vocal Emotion Recognition
48 / 49
INTERSPEECH 2009 Emotion Challenge (cont.)
5-class problem: Anger, Emphatic, Neutral, Positive, Rest
unweighted average recallweighted average recall
45
55
40
35
50
38.239.4 39.4
41.2 41.4 41.4 41.6 41.6 41.7
44.0
Dumouchel et al.
Planet et al.
Luengo et al.
Vlasenko et al.
Lee et al.
Kockmann et al.
Majority voting
Barra-Chicote et al.
Vogt el al.
Baseline
Bozkurt et al.
aver
age
reca
ll[%
]
38.2
S. Steidl: Vocal Emotion Recognition
49 / 49
State-of-the-Art: Summary
Berlin Emotion Speech Database7-class problem: hot anger, disgust, fear/panic, happiness,sadness/sorrow, boredom, neutralbalanced classes+ 90 % accuracy
FAU Aibo Emotion Corpus4-class problem: Anger, Emphatic, Neutral, Motheresesubset with roughly balanced classes (Aibo chunk set)+ 69 % unweighted average recall
5-class problem: Anger, Emphatic, Neutral, Positive, Resthighly unbalanced classes, complete corpus+ 44 % unweighted average recall
2-class problem: NEGative vs. IDLehighly unbalanced classes, complete corpus+ 71 % unweighted average recall
S. Steidl: Vocal Emotion Recognition