Speaker-Dependent Audio-Visual Emotion Recognition

download Speaker-Dependent Audio-Visual Emotion Recognition

If you can't read please download the document

description

Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary. Speaker-Dependent Audio-Visual Emotion Recognition. Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, - PowerPoint PPT Presentation

Transcript of Speaker-Dependent Audio-Visual Emotion Recognition

  • Speaker-Dependent Audio-Visual Emotion RecognitionSanaul HaqPhilip J.B. JacksonCentre for Vision, Speech and Signal Processing, University of Surrey, UK

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Introduction

    Method

    Audio-visual experiments

    Summary Overview

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Introduction

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

    Human-to-human interaction: emotions play an important role by allowing people to express themselves beyond the verbal domain.

    Message: Linguistic information: textual content of a message Paralinguistic information: body language, facial expressions, pitch and tone of voice, health and identity. [Busso et al., 2007]

    Audio based emotion recognition: Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005.

    Visual based emotion recognition: Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007.

    Audio-visual based emotion recognition: Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.

  • Recording of audio-visual database

    Feature extraction

    Feature selection

    Feature reduction

    Classification Method

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise.

    15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences.

    Size of database: 480 sentences, 120 utterances per subject.

    For visual features, face was painted with 60 markers.

    3dMD dynamic face capture system [3dMD capture system, 2009 ]. Audio-visual database

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

    Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).

  • Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs.

    Visual features mean and standard deviation of 2D facial marker coordinates (60 markers). Feature extraction

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations. Feature extraction

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Plus l-Take Away r algorithm [Chen, 1978]

    Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms.Criterion: Bhattacharyya distance

    Feature selection

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Principal Component AnalysisLinear Discriminant Analysis

    Feature reduction

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule is described as

    Classification

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

    where is the posterior probability, is the prior class probability.

  • Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios. 10 subjects: 5 native speaker, 5 non-native. Half of the evaluators were female.10 different sets of the data were created by using Balanced Latin Square method. [Edwards, 1962]Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion.Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral). Human evaluation experiments

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • AudioVisualAudio-visualDecision-level fusionOvercomes the problem of different time scales and metric levels of audio and visual data.High dimensionality of the concatenated vector resulted in case of feature-level fusion.Haq, Jackson & Edge, 2008.

    Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;

    Experiments

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes. Results

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes. Results

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • 7 Emotion Classes Results

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

    HumanKLJEJKDCMean (CI)Audio53.267.771.273.7 66.5 2.5Visual89.089.888.684.788.0 0.6Audio-visual92.192.1 91.391.791.8 0.1

    LDA 6KLJEJKDCMean (CI)Audio35.062.556.770.856.3 6.7Visual96.792.592.5 10095.4 1.6Audio-visual97.5 96.795.810097.5 0.8

    PCA 10KLJEJKDCMean (CI)Audio30.851.755.062.550.0 6.0Visual91.787.589.298.391.7 2.1Audio-visual92.586.794.298.392.9 2.1

  • Results

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

    4 Emotion Classes

    HumanKLJEJKDCMean (CI)Audio63.280.979.282.076.3 2.4Visual90.697.290.087.491.3 1.1Audio-visual94.498.393.594.595.2 0.6

    PCA 7KLJEJKDCMean (CI)Audio50.058.365.872.561.7 4.3Visual80.898.393.388.390.2 3.3Audio-visual83.397.595.093.392.3 2.7

    LDA 3KLJEJKDCMean (CI)Audio60.075.858.380.068.5 4.8Visual97.510094.210097.9 1.2Audio-visual97.510094.210097.9 1.2

  • ConclusionsFor British English emotional database a recognition rate comparable to human was achieved (speaker-dependent).The LDA outperformed PCA with the top 40 features. Both audio and visual information were useful for emotion recognition.Important audio features: Energy and MFCC.Important visual features: mean of marker y-coordinate (vertical movement of face).Reason for higher performance of system: evaluators were not given equal opportunity to exploit the data.

    Future WorkTo perform speaker-independent experiments, using more data and other classifiers, such as GMM and SVM. Summary

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14.

    Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573.

    Borchert, M. & Dsterhft, A. (2005). Emotions in Speech Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147151.

    Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347.

    Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211.

    Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands.

    Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston.

    3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.

    References

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc. Auditory-Visual Speech Processing, pp. 185-190.

    Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Intl Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901.

    Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496.

    Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724.

    Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37.

    Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025.

    Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Intl Conf. Multimodal Interfaces, pp. 38-45.

    Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128.

    Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.

    References

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

  • Thank you for your attention! R

    Speaker-Dependent Audio-Visual Emotion Recognition

    Sanaul HaqPhilip J.B. Jackson

    Outline

    Introduction

    Method

    Audio-visual experiments

    Summary

    International Conference on Auditory-Visual Speech Processing, UEA, Norwich, UK