Vocal Emotion Recognition Overview Emotion Recognition State-of-the-Art in Classiﬁcation of...

Vocal Emotion RecognitionState-of-the-Art in Classification of Real-Life Emotions

October 26, 2010

Stefan Steidl

International Computer Science Institute (ICSI)

at Berkeley, CA

2 / 49

Overview

1 Different Perspectives on Emotion Recognition

2 FAU Aibo Emotion Corpus

3 Own Results on Emotion Classification

4 INTERSPEECH 2009 Emotion Challenge

S. Steidl: Vocal Emotion Recognition

3 / 49

Overview

1 Different Perspectives on Emotion RecognitionPsychology of EmotionComputer Science





4 / 49

Facial Expressions of Emotion


5 / 49

Universal Basic Emotions

Paul Ekman

postulates the existence of6 basic emotions:

anger, fear, disgust,surprise, joy, sadness

other emotions are mixed or blendedemotionsuniversal facial expressions


6 / 49

Terminology

Different affective states [1]:

type of affective state inten- dura- syn- event appraisal rapid- behav-sity tion chroni- focus elicita- ity of ioral

zation tion change impactemotion :: - ::: : ::: ::: ::: ::: :::

mood : - :: :: : : : :: :

interpersonal stances : - :: : - :: : :: : ::: ::

attitudes ◦ - :: :: - ::: ◦ ◦ : ◦ - : :

personality traits ◦ - : ::: ◦ ◦ ◦ ◦ :

◦: low, :: medium, ::: high, :::: very high, -: indicates a range

[1] K. R. Scherer: Vocal communication of emotion: A review of researchparadigms, Speech Communication, Vol. 40, pp. 227-256, 2003


7 / 49

Terminology (cont.)

Definition of Emotion

Emotion (Scherer)

episodes of coordinated changes in several components including at least:

neurophysiological activation,

motor expression, and

subjective feeling but possibly also

action tendencies and cognitive processes

in response to external or internal events of major significance to theorganism


8 / 49

Vocal Expression of Emotion

Results from studies in Psychology of Emotion

anger/ fear/ sadness joy/ boredom stressrage panic elation

Intensity Ú Ú Ø Ú ÚF0 floor/mean Ú Ú Ø Ú ÚF0 variability Ú Ø Ú Ø

F0 range Ú Ú(Ø)1 Ø Ú ØSentence contour Ø Ø

High frequency energy Ú Ú Ø (Ú)2

Speech and articulation rate Ú Ú Ø (Ú)2 Ø

1 Banse and Scherer found a decrease in F0 range2 inconclusive evidence

Goal

Classification of the subject’s actual emotional state(some sort of lie detector for emotions)


9 / 49

Human-Computer Interaction (HCI)

Emotion-Related User States

naturally occurring states of users in human-machinecommunicationemotions in a broader sensecoordinated changes in several components NOT requiredclassification of the perceived emotional state,not necessarily the actual emotion of the speaker


10 / 49

Pattern Recognition

Pattern Recognition Point of View

classification task: choose 1 of n given classesdiscrimination of classes rather than classificationdefinition of “good” featuresmachine classification

Actually not neededdefinition of term emotioninformation on how specific features change


11 / 49

Emotional Speech Corpora

Acted data

based on Basic Emotions theorysuited for studying prototypical emotionscorpora easy to create (inexpensive, no labeling process)high audio qualitybalanced classesneutral linguistic content (focus on acoustics only)high recognition results


12 / 49

Emotional Speech Corpora (cont.)

Popular corpora

Emotional Prosody Speech and Transcript corpus (LDC): 15classes

Berlin Emotional Speech Database (EmoDB): 7 classes

89.9 % accuracy (speaker independent LOSO evaluation, speakeradaptation, feature selection) [2]

Danish Emotional Speech Corpus: 5 classes

74.5 % accuracy (10-fold SCV, feature selection) [3]

[2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for RobustRecognition of Emotions within Speech, INTERSPEECH 2007

[3] Schuller et al.: Emotion Recognition in the Noise Applying Large AcousticFeature Sets, Speech Prosody 2006


13 / 49

Emotional Speech Corpora (cont.)

Naturally occurring emotions

states that actually appear in HCI (real applications)difficult to create (appropriate scenario needed, ethical concerns,need to label data)low emotional intensityin general ≥ 80% neutrallow audio quality (reverberation, noise, far-distance microphones)needed for machine classification (because conditions betweentraining and test must not differ too much)research on both acoustic and linguistic features possiblenew research questions: optimal emotion unitalmost no corpora large enough for machine classificationavailable (do not exist or are not available for research)


14 / 49

Overview


2 FAU Aibo Emotion CorpusScenarioLabeling of User StatesData-driven Dimensions of EmotionUnits of AnalysisSparse Data Problem




15 / 49

The FAU Aibo Emotion Corpus

51 children (30 f, 21 m) at the age of 10 to 13

8.9 hours of spontaneous speech (mainly short commands)

48,401 words in 13,642 audio files


16 / 49

FAU Aibo Emotion Corpus (cont.)

data base for CEICES and INTERSPEECH 2009 EmotionChallenge

available for scientific, non-commercial usehttp://www5.cs.fau.de/FAUAiboEmotionCorpus

[4] S. Steidl: Automatic Classification of Emotion-Related User Statesin Spontaneous Children’s Speech, Logos Verlag, Berlin

available online:http://www5.cs.fau.de/en/our-team/steidl-stefan/dissertation/


http://www5.cs.fau.de/FAUAiboEmotionCorpus

http://www5.cs.fau.de/en/our-team/steidl-stefan/dissertation/

17 / 49

Emotion-Related User States

11 categories: prior inspection of the data before labeling

joyfulsurprisedmothereseneutralboredemphatichelplesstouchy/irritatedreprimandingangry

other

motherese

the way mothers/parents address their babies –either because Aibo is well-behaving or becausethe child wants Aibo to obey; positive equivalent toreprimanding

emphatic

pronounced, accentuated, sometimeshyper-articulated way but without showing anyemotion

reprimanding

the child is reproachful, reprimanding, ‘wags thefinger’


18 / 49

Labeling of User States

Labeling:5 students of linguisticsholistic labeling on the word levelmajority vote

emotion category wordsangry (A) 134 0.3 %touchy (T) 419 0.9 %reprimanding (R) 463 1.0 %emphatic (E) 2,807 5.8 %neutral (N) 39,975 82.6 %motherese (M) 1,311 2.7 %joyful (J) 109 0.2 %...all 48,401 100.0 %


19 / 49

Labeling of User States (cont.)

Confusion matrix

emotion category A T R E N M J

maj

ority

vote

angry (A) 43.3 13.0 12.9 12.1 18.1 0.1 0.0touchy (T) 4.5 42.9 11.7 13.7 23.5 1.0 0.1reprimanding (R) 3.8 15.7 45.8 14.0 18.2 1.3 0.1emphatic (E) 1.3 5.8 6.7 53.6 29.9 1.2 0.5neutral (N) 0.4 2.2 1.5 13.9 77.8 2.7 0.5motherese (M) 0.0 0.8 1.4 4.9 30.4 61.1 0.9joyful (J) 0.1 0.6 1.1 7.3 32.4 2.0 54.2


20 / 49

Data-driven Dimensions of Emotions

Non-metric dimensional scaling:arranging the emotion categories in the 2-dimensional spacestates that are often confused are close to each other

negative positive

valence

−interaction

+interaction

inte

ract

ion

angry

touchy

motherese

neutral

joyful

reprimanding

emphatic


21 / 49

Units of Analysis

Units of analysis

Aibo fein dasdumachstg’radeaus

word level

chunk level

turn level

stopp sitzstopp

Ohm_18_343Ohm_18_342

v1 v2 s3p3

Advantages/disadvantages of larger units+ more information− less emotional homogeneity


22 / 49

Sparse Data Problem

Super classes:

Anger: angry, touchy/irritated, reprimandingEmphaticNeutralMotherese

0.5

-1

-0.5

0

1

0.5

0

1

-0.5

-1

0 0.5 1 1.5-1-1.5 -0.5 -1.5 -1 0.5 1 1.50-0.5

angr

y

repr

iman

ding

neut

ral

mot

here

se

touc

hy

emph

atic

joyf

ul

S = 0.32 RSQ = 0.73

Neutral

Anger

Motherese

S = 0.19 RSQ = 0.90

Emphatic


23 / 49

Sparse Data Problem (cont.)

Data subsets

Aibo word setAibo chunk setAibo turn setAibo corpus

data set number of taken fromwords # chunks # turns

Aibo corpus 48,401 18,216 13,642Aibo word set 6,070 4,543 3,996Aibo chunk set 13,217 4,543 3,996Aibo turn set 17,618 6,413 3,996


24 / 49

Overview



3 Own Results on Emotion ClassificationResults for different Units of AnalysisMachine vs. HumanFeature Types and their Relevance



25 / 49

Most Appropriate Unit of Analysis

Classificationcomplete set of featuresclassification with Linear Discriminant Analysis (LDA)51-fold speaker-independent cross-validation

unit of number of number of averageanalysis features samples recallword level 265 6,070 words 67.2 %chunk level 700 4,543 chunks 68.9 %turn level 700 3,996 turns 63.2 %

Chunks: best compromise betweenlength of the segmenthomogeneity of the emotional state within the segment


26 / 49

Machine Classifier vs. Human Labeler

Entropy based measure:

A E MN

0.00.250.75 0.0

1234

AEA

classlabeler

A

12→

+

A E N M

0.00.0 1.0decoder:

A N ME

0.0 0.50.375 0.125

12

Hdec = 1.41

→M0.0

implicit weighting of classification ‘errors’depending on the word that is classified


27 / 49

Machine Classifier vs. Human Labeler (cont.)

Classification: Aibo word set

avg. human labelermachine classifier0.2

0.15

0.1

0

0.05

0.25

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6entropy

rel.

frequ

ency

[%]

[5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann:“Of All Things the Measure is Man” – Classification of Emotions andInter-Labeler Consistency, ICASSP 2005


28 / 49

Evaluation of Different Types of Features

Types of featuresacoustic features

prosodic featuresspectral featuresvoice quality features

linguistic features

EvaluationArtificial Neural Networks (ANN)51-fold speaker-independent cross-validationcombination by early or late fusion


29 / 49

Acoustic Features: Prosody

Prosody

suprasegmental characteristics such aspitch contourenergy contourtemporal shortening/lengthening of wordsduration of pauses between words


30 / 49

Acoustic Features: Prosody (cont.)

Classification results: Aibo chunk set

0

50.654.4

58.5 59.0

10

20

30

40

50

60

70

80

42.0

F0 (29)

duration (37)

energy (25)

allpauses (16)

aver

age

reca

ll[%

]


31 / 49

Acoustic Features: Spectral Characteristics (cont.)


40

50

60

70

80

20

30

10

0

59.0 58.9

48.2

prosody (107)

MFCC(24)

best combination

HNR(2)

jitter/shimmer (4)

formants (16)

TEO(64)

aver

age

reca

ll[%

]


32 / 49

Acoustic Features: Voice Quality


40

50

60

70

80

20

30

10

0

59.0 58.9

48.2 47.0

32.5

52.3

prosody (107)

MFCC(24)

formants (16)

jitter/shimmer (4)

HNR(2)

TEO(64)

best combination

aver

age

reca

ll[%

]


33 / 49

Acoustic Features: Combination


40

50

60

70

80

20

30

10

0

59.0 58.9

48.2 47.0

32.5

52.3

65.4

prosody (107)

MFCC(24)

formants (16)

jitter/shimmer (4)

HNR(2)

best combination

TEO(64)

aver

age

reca

ll[%

]


34 / 49

Linguistic Features

Types of linguistic features

word characteristicsaverage word length (number of letters, phonemes, syllables)proportion of word fragmentsaverage number of repetitions

part-of-speech features

unigram models

bag-of-words


35 / 49

Linguistic Features (cont.)

Part-of-Speech (POS) Featuresonly 6 coarse POS categoriescan be annotated without considering context

Anger

Emphatic

Neutral

Motherese

Joyful

Other-

%of

tota

l

nouns, proper names

inflected adjectives

particles, interjectionsarticles, pronouns,

auxiliaries

present/past participlesnot inflected adjectives

(other) verbs, infinitives


36 / 49


Unigram Models

u(w ,e) = log10P(e|w)

P(e)

Anger P(A|w) Emphatic P(E|w)böser (bad) 29.2 % stopp (stop) 30.5 %stehenbleiben (stop) 18.9 % halt (halt) 29.3 %nein (no) 17.0 % links (left) 20.5 %aufstehen (get up) 12.3 % rechts (right) 18.9 %Aibo (Aibo) 10.1 % nein (no) 17.6 %

Neutral P(N|w) Motherese P(M|w)okay (okay) 98.6 % fein (fine) 57.5 %und (and) 98.5 % ganz (very) 41.9 %Stück (bit) 98.5 % braver (good) 36.0 %in (in) 98.2 % sehr (very) 23.5 %noch (still) 96.2 % brav (good) 21.7 %


37 / 49


Bag-of-Words

14. . . 0 0 1

414

14

Aibolein

allen

. . . . . . . . . . . .

utterance: Aibo, geh nach links! (Aibo, move to the left!)

Aibo geh nach links

representation of the linguistic contentword order getting lostvarious dimensionality reduction techniques


38 / 49


Classification results: Aibo chunk set80

70

60

50

40

30

20

10

0

54.3 56.1

61.9 61.9 62.2

POS(6)

unigrammodels (16)

word statistics (6)

best combination

BOW(254→

50)

aver

age

reca

ll[%

]


39 / 49

Combination of Acoustic and Linguistic Features


65.4 62.267.1 68.9

80

70

60

50

40

30

20

10

0 best combination

(early fusion, LDA)

(late fusion, ANN)

acoustic features

(late fusion, ANN)

linguistic features

best combination

(late fusion, ANN)

combination

combination

aver

age

reca

ll[%

]


40 / 49

Similar Results within CEICES

CEICES: Combining Efforts for Improving Automatic Classification of EmotionalUser States

collaboration of various research groups within the EuropeanNetwork of Excellence HUMAINE (2004-2007)state-of-the-art feature set with ≥ 4,000 features

SVM (linear kernel), 3-fold speaker-independent cross-validation

selection of 150 features (SFFS): surviving feature types?

only chunk based features, no information outside Aibo chunk set

[6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L.Vidrascu, V. Aharonson, L. Kessous, N. Amir:

Whodunnit – Searching for the Most Important Feature Types SignallingEmotion-Related User States in Speech,

Computer, Speech, and Language, Vol. 25, Issue 1 (January 2011), pp. 4-28


41 / 49

Similar Results within CEICES(cont.)

dura

tion

ener

gy

F 0 spec

trum

ceps

trum

voic

equ

ality

wav

elet

s

alla

cous

tic

BO

W

PO

S

high

erse

man

tics

varia

alll

ingu

istic

all

# total 391 265 333 656 1699 153 216 3713 476 31 12 12 531 4244

SFF

S

# 10 32 16 15 16 7 5 101 25 7 17 0 49 150F MEASURE 49.6 56.3 46.8 46.2 46.4 38.7 35.3 – 37.4 48.1 56.0 – – 65.5SHARE 6.7 21.3 10.7 10.0 10.7 4.7 3.4 67.3 16.7 4.7 11.3 0.0 32.7 100.0PORTION 2.6 12.1 4.8 2.3 1.0 4.6 2.3 2.7 5.3 22.6 141.7 0.0 9.6 3.5

SFF

S

# 28 33 23 17 23 11 15 150 94 27 27 2 150F MEASURE 54.9 56.9 46.7 49.9 50.4 41.5 44.9 63.4 53.2 54.9 57.9 – 62.6SHARE 18.7 22.0 15.3 11.3 15.3 7.3 10.0 100.0 62.7 18.0 18.0 0.1 100.0PORTION 7.2 12.5 6.9 2.6 1.4 7.2 6.9 4.0 19.7 87.1 225.0 16.7 28.2


42 / 49

Overview






43 / 49

INTERSPEECH 2009 Emotion Challenge

New goals:

challenge with standardized test conditions

open microphone: using the complete corpushighly unbalanced classesincluding all observed emotional categoriesincluding chunks with low inter-labeler agreement


44 / 49

INTERSPEECH 2009 Emotion Challenge (cont.)

Speaker independent training and test sets

2-class problem: NEGative vs. IDLe

# NEG IDL∑

train 3 358 6 601 9 959test 2 465 5 792 8 257∑

5 823 12 393 18 216

5-class problem: Anger, Emphatic, Neutral, Positive, Rest

# A E N P R∑

train 881 2 093 5 590 674 721 9 959test 611 1 508 5 377 215 546 8 257∑

1 492 3 601 10 967 889 1 267 18 216


45 / 49


Sub-Challenges

1 Feature Sub-Challengeoptimisation of feature extraction/selection;classifier settings fixed

2 Classifier Sub-Challengeoptimisation of classification techniques;feature set given

3 Open Performance Sub-Challengeoptimisation of feature extraction/selection andclassification techniques


46 / 49


Participants

Open Performance Classifier FeatureSub-Challenge Sub-Challenge Sub-Challenge number of

2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants3 3 – – – – 73 – – – – – 2– – 3 3 – – 2– – – 3 – – 1– – – 3 3 3 1– – – – 3 3 1

[7] B. Schuller, A. Batliner, S. Steidl, D. Seppi:Recognising Realistic Emotions and Affect in Speech: State of the Art andLessons Learnt from the First Challenge, Speech Communication, Special IssueSensing Emotion and Affect - Facing Realism in Speech Processing, to appear


47 / 49


2-class problem: NEGative vs. IDLe

unweighted avg. recallweighted avg. recall

60

62

64

68

70

72

74

66

71.270.3

69.268.367.967.667.267.1

67.766.4

Barra-Chicote et al.

Polzehl et al.

Vogt et al.

Bozkurt et al.

Luengo et al.

Kockmann et al.

Vlasenko et al.

Dumouchel et al.

Majority voting

Baseline

aver

age

reca

ll[%

]


48 / 49


5-class problem: Anger, Emphatic, Neutral, Positive, Rest

unweighted average recallweighted average recall

45

55

40

35

50

38.239.4 39.4

41.2 41.4 41.4 41.6 41.6 41.7

44.0

Dumouchel et al.

Planet et al.

Luengo et al.

Vlasenko et al.

Lee et al.

Kockmann et al.

Majority voting

Barra-Chicote et al.

Vogt el al.

Baseline

Bozkurt et al.

aver

age

reca

ll[%

]

38.2


49 / 49

State-of-the-Art: Summary

Berlin Emotion Speech Database7-class problem: hot anger, disgust, fear/panic, happiness,sadness/sorrow, boredom, neutralbalanced classes+ 90 % accuracy

FAU Aibo Emotion Corpus4-class problem: Anger, Emphatic, Neutral, Motheresesubset with roughly balanced classes (Aibo chunk set)+ 69 % unweighted average recall

5-class problem: Anger, Emphatic, Neutral, Positive, Resthighly unbalanced classes, complete corpus+ 44 % unweighted average recall

2-class problem: NEGative vs. IDLehighly unbalanced classes, complete corpus+ 71 % unweighted average recall


Vocal Emotion Recognition Overview Emotion Recognition State-of-the-Art in Classiﬁcation of...

Documents

Transcript of Vocal Emotion Recognition Overview Emotion Recognition State-of-the-Art in Classiﬁcation of...