Contextual Framework for Speech Based Emotion Recognition...

Speech Based Emotion Classification Framework for Driver AssistanceSystem

Ashish Tawari and Mohan TrivediLISA: Laboratory for Intelligent and Safe Automobiles

University of California San Diego, Dept. of [email protected], [email protected]

Abstract— Automated analysis of human affective behaviorhas attracted increasing attention in recent years. Driver’semotion often influences driving performance which can be im-proved if the car actively responds to the emotional state of thedriver. It is important for an intelligent driver support systemto accurately monitor the driver’s state in an unobtrusive androbust manner. Ever changing environment while driving posesa serious challenge to existing techniques for speech emotionrecognition. In this paper, we utilize contextual information ofthe outside environment as well as inside car user to improve theemotion recognition accuracy. In particular, a noise cancellationtechnique is used to suppress the noise adaptively based onthe driving context and a gender based context informationis analyzed for developing the classifier. Experimental analysesshow promising results.

Index Terms— Emotion recognition; vocal expression; affec-tive computing; affect analysis; context analysis

I. INTRODUCTION

The automotive industry is integrating more and moretechnologies into modern cars. They are equipped withnumerous interactive technologies including high qualityaudio/video systems, satellite navigation, hands-free mobilecommunication, control over climates and car behavior. Withincreasing use of such on-board electronics and in-vehicleinformation systems, the evaluation of driver task demandis becoming an area of increasing importance to both gov-ernment agencies and industry [1]. With potentially morecomplex devices for the driver to control, risk of distractingdriver’s attention increases. Current research and attentiontheory have suggested that speech-based interactions are lessdistracting than interaction with visual display [2]. Therefore,to improve both comfort and safety in the car, the driver as-sistance technologies need effective speech interface betweenthe driver and the infotainment system. The introductionof speech-based interactions and conversation into the carbrings out potential importance of linguistic cues (like wordchoice and sentence structure) and paralinguistic cues (likepitch, intensity, and speech rate etc.). Such cues play afundamental role in enriching human-human interaction andincorporate among other things, personality and emotion [3].Driving in particular presents a context in which a user’semotional state plays a significant role. Emotions have beenfound to affect cognitive style and performance. Even mildlypositive feeling can have a profound effect on the flexibilityand efficiency of thinking and problem solving [4]. The road-

rage phenomenon [5] is a well recognized situation whereemotions can directly impact safety. Such phenomenon canbe mitigated and driving performance can be improved ifthe car actively responds to the emotional state of the driver.Research studies show that matching in-car voice and driver’semotional state has great impact on driving performance[6]. While number of studies have shown the potential ofemotion monitoring, automatic recognition of emotion isstill a very challenging task, specially when it comes toreal world driving scenario. The car setting is far from theideal environment for speech acquisition and processing needto deal with reverberation and noisy conditions inside carcockpit. Speech recognition and rendering technology needto make sure not to negatively influence the user, e.g. user canbecome frustrated by poor voice recognition. In this paper,we show the potential of utilizing contextual informationof outside car environment and inside car driver towardsimproved emotion classification.

II. DRIVER AFFECT ANALYSIS: BRIEF OVERVIEW

Automated analysis of human affective behavior has at-tracted increasing attention in recent years. With researchshift towards spontaneous behavior, real world databasecollection strategies [7], new feature sets [8], use of con-text information (e.g subject, gender) and development ofapplication specific systems [9] are encouraged for improvedsystem performance. In [10] a good overview of recent ad-vancement towards spontaneous behavior analysis in audio,visual and audiovisual domains is presented. We focus ourattention in driving specific application.

[11] performs stress level recognition analysis in realworld driving tasks by monitoring physiological changesusing biometric measurements. The analysis shows that threestress levels could be recognized with an overall accuracy of97.4%. [7] estimates driver’s irritation level using physiolog-ical signals and Bayesian Networks. No quantitative anal-ysis is presented, however, qualitative analysis sugests thepossibility of irritation estimation to certain extent. In [12],driving similator along with other peripherals (microphonesand speakers) are employed for the analysis of emotionrecognition. The system uses 10 acoustic features includingpitch, volume, rate of speech and other spectral coefficients.The recognition performace is assessed by comparing thehuman emotion transcript against the output from the auto-

2010 IEEE Intelligent Vehicles SymposiumUniversity of California, San Diego, CA, USAJune 21-24, 2010

TuB1.29

978-1-4244-7868-2/10/$26.00 ©2010 IEEE 174

matic emotion recognition system. The qualitative analysissugests that the system is capable of detecting driver’semotion with sufficient accuracy in order to support andimprove driving. This study, however, ignores the presenceof background noise. [9] shows the feasibility of detectingdriver’s emotional state via speech in presence of variousnoise scenario. The speech signal is superimposed with carnoise of several car types and various raod surfaces. Theemotion conveyed in noisy speech signal was automaticallyclassified using 20 acoutic features. Training of classifiersare performed using clean speech and noisy speech. Laterprovides a matched training and test conditions and showsbetter performance. Next, it shows usefulness of highpassfiltering to reduce noise level for improved performancefor the case of training with clean speech. In this work,we extend this idea and incorporate contexual informationof outside and inside car environment. In particular, weshow the potential of using adaptive filtering techiniqueto reduce the noise level. Further, we demonstrate the useof gender information (user context) for improved emotionclassification.

III. SPEECH TO EMOTION: COMPUTATIONALFRAMEWORK

Our focus in this paper is to analyze the performance of asystem which can adapt to the continually changing outsideenvironment as well as the inside user of the car. While it istrue that it is matter of time when the cars would be fittedwith camera technology, we have adopted a speech basedemotion recognition system since the speech interaction sys-tem is commonplace in today’s cars. The major challenge inspeech processing technology is presence of non-stationarynoise, specifically during driving, due to ever changing out-side car environment. Adaptive noise cancellation techniqueseems an obvious choice to enhance the speech quality basedon the context of the outside environment. On the otherhand, use of the user context information (e.g. gender) canimprove the emotion recognition [13], [14], [15]. We exploreusefulness of such context information in driving task.

Fig. 1 shows the block diagram representing the three fun-damental steps of information acquisition and preprocessing,extraction and processing of parameters, and classification ofsemantic units. The pre-processing stage consists of speechenhancement module which adopts to the context of theoutside environment. Time series of acoustic features areextracted from the enhanced speech, which in turn are post-processed to provide the statistical information over thesegment to be analyzed. The classification system has threedifferent phases: feature extraction and selection (phase 1) toidentify features to be used during classification; the modeltraining phase (phase 2) and finally, testing phase (phase3) to evaluate the performance of the system in terms ofclassification accuracy. Remainder of the paper is organizedbased on these phases.

Fig. 1. Emotion classification system.

IV. COMPUTATIONAL MODULES

A. Adaptive speech enhancement

During last decades, several speech enhancement tech-niques have been proposed ranging from beamformingthrough microphone arrays to adaptive noise filtering ap-proaches. In this work, we utilize a speech enhancementtechnique based on the adaptive thresholding in waveletdomain [16]. Objective criterion for parameter selection,however, used in our experiments was the classificationperformance as opposed to Signal-to-Noise-Ratio (SNR).Fig. 2 shows the visualization of the output of the speechenhancement technique.

B. Feature extraction

A variety of features have been proposed to recognizeemotional states from speech signal. These features canbe categorized as acoustic features and linguistic features.We avoid using latter since these demand for robust recog-nition of speech in first place and also are a drawbackfor multi-language emotion recognition. Hence we onlyconsider acoustic features. Within the acoustic category,we focus on prosodic features like speech intensity, pitchand speaking rate, and spectral features like mel frequencycepstral coefficients (MFCC) to model emotional states. Forpitch calculation, we used the auto-correlation algorithmsimilar to [17]. Speech intensity is represented by log-energycoefficients calculated using 30ms frames with shift intervalof 10ms.

Value of framewise parameters extracted from a fewmilliseconds of audio is of little significance to determinean emotional state. It is, on the contrary, of interest tocapture time trend of these features. In order to capture thecharacteristics of the contours, we perform cepstrum analysisover the contour. For this, we first interpolate the contour

175

(a) (b)

Fig. 2. Speech enhancement performance. Top: clean speech, middle: noisy speech (clean speech corrupted by WGN) and bottom: enhanced speech.Signal-to-Noise-Ratio (SNR) in noisy speech is (a) 10 dB and (b) 0 dB.

to obtain samples separated by sampling period of speechsignal, which is then used to calculate cepstrum coefficientsas follows:

c(m) =1N

N−1∑k=0

log |X(k)|e2πkmj/N

m = 0, 1, · · · , N − 1

where X(k) demotes the N -point discrete Fourier transformof the windowed signal (x(n)). Cesptrum analysis is asource-filter separation process commonly used in speechprocessing. Cepstrum coefficients c(0) to c(13) and theirtime derivative (first and second order), calculated from 480samples, are utilized to obtain the spectral characteristic ofthe contours. For pitch contour analysis only voiced portionis utilized. Other features that we utilized are 13 MFCCsC1−C13 with their delta and acceleration components. Inputsignal is processed using 30ms hamming window with frameshift interval of 10ms.

For all these sequences following statistical informa-tion is calculated: mean, standard deviation, relative maxi-mum/minimum, position of relative maximum/minimum, 1stquartile, 2nd quartile (median) and 3rd quartile. Speakingrate is modeled as fraction of the voiced segments. Thus,the total feature vector per segment contains 3 · (13 + 13 +13) · 9 + 1 = 1054 attributes.

C. Feature Selection

Intuitively, a large number of features would improvethe classification performance, however, in practice a largefeature space suffers from the phenomenon of ‘curse ofdimensionality’. Therefore in order to improve the classifi-cation performance, a feature selection technique is utilized.

One such method to eliminate redundant and irrelevantfeatures is to identify features with high correlation with theclass but low correlation among themselves. We used CF-SSubsetEval with best-first search strategy feature selectiontechnique provided by WEKA [18]. We used stratified 10fold selection procedure, where one fold is removed and rest

Fig. 3. Classification accuracy over 3 classes in LISA-AVDB using differ-ent aggregate of features and SVM classifier for ‘combined’ (male+female),‘male only’, and ‘female-only’ context.

is used for feature selection. Thus, we have 10 different setsof selected attributes. An attribute may get selected in n-number of times in different sets. We group the attributeswhich are selected at least n-number of times and call them‘n-10’ aggregate. Fig. 3 shows the performance results fordifferent aggregates with SVM classifier for gender depen-dent and independent analysis. ‘2-10’ aggregate providedthe best results for ‘combined’ (male+female) and ‘maleonly’ context, and ‘4-10’ aggregate provided best results for‘female-only’ context.

V. IN-VEHICLE DATA COLLECTION

Collecting real world driving data is certainly useful forresearchers interested in design and development of intel-ligent vehicular system capable of dealing with differentsituations in traffic. Real world data, as opposed to drivingsimulator recordings, are much more closely related to theday to day situations and can provide valuable informationon drivers’ behavior. In our ongoing research on driverassistance systems [19] and audiovisual scene understanding[20], emotion recognition using multi-modality will certainly

176

Fig. 4. Data acquisition and online recognition setting. Numbered boxesare: 1 Camera, 2. Microphone, 3. Online recognition of emotional speechinterface, 4. user.

help us improve the interface. With this motivation, we haveput in significant effort on collection of audiovisual affectdatabase in a challenging ambient of car settings.

A. LISA audio-visual affect database

The user at the driver’s seat was prompted by a computerprogram specifying the emotion to be expressed. It alsoprovides example of an utterance that can be used by thedriver. We also recorded free conversion between passengerand driver. The database is collected in stationary and movingcar environment. The cockpit of automobile does not providethe comfort of noiseless anechoic environment. In fact,moving automobile with a lot of road noise has a drasticeffect on SNR for audio channel as well as challengingillumination condition for video channel. In this study, weanalyze emotional speech data in the stationary car settingwhich gives the effect of the cockpit of the car with relativelyhigh SNR value.

The database is collected with the use of an analog videocamera facing the driver and an directional microphonebeneath steering wheel. Fig. 4 shows settings of the cameraand microphone. Video frames were acquired approximately30 frames per second and the audio signal, captured, isresampled to 16 kHz sampling rate. A version of a softwarefor synchronizing as well as labeling the data is developed.Fig. 5 shows a snippet of the tool. The emotional speechhas been labeled into 3 groups ‘pos’, ‘neg’ and ‘neu’ forpositive, negative and neutral expressions respectively. Thedata has been acquired with 4 different subjects: 2 male and2 female. Distribution of data for different categories is: 82pos, 82 neg, and 60 neu.

VI. EXPERIMENTAL ANALYSIS

In this section, we provide series of experiments usingcontext information based on outside and inside of thecar system. The recognition system requires to incorporatedifferent noise conditions and changing environment whiledriving. To study the feasibility of emotion detection whiledriving, we aim to emulate the vehicle conditions. To reflect

Fig. 5. Development platform for audiovisual scene synchronization,cropping and labeling. It has capability of voice activity detection (VAD)to decide precise onset of speech signal.

Fig. 6. SNR distribution in two extreme driving scenario: Freeway (left)and parking lot (right).

the reality adequately, several noise scenarios of approxi-mately 60 seconds were recorded in the car while driving.Interior noise depends on several factors such as vehicletype, road surfaces, outside environment etc. In this study, aninstrumented Infinity Q45 on-road testbed is used to recordthe interior noise in following scenarios: highway (FWY),parking lot (PRK) and city street (CST). To cover a widerange of car conversations, speech should be superimposedby the recorded noise with the correct Signal-to-Noise Ratio(SNR). To estimate the SNR level, We analyze SNR levelbased on real world data collection. Fig. 6 shows distributionof SNR levels at two extreme cases of freeway drivingand parking lot scenario. Hence different noise models asexplained in section III are added to clean speech at threeSNR levels: 15 dB, 10 dB and 5 dB.

All results are calculated from 10-fold cross validation. Forclassification we use Support Vector Machines (SVM) withlinear kernel and 1-vs-1 multiclass discrimination. SVMs aretrained with the Sequential Minimal Optimization (SMO)algorithm. Table I provides the performance measurementin clean speech. Over all recognition accuracy of over87% is achieved for three class emotion detection. Thisbaseline results show the usefulness of our feature setand classification technique. Further the use of the genderinformation increases overall accuracy to 94%. Over allrecognition accuracy with and without front end processing(speech enhancement) are reported in Table III. It is evident

177

TABLE IGENDER INDEPENDENT CONFUSION TABLE USING CLEAN SPEECH FOR

LISA-AVDB DATABASE WITH 88.1% OVERALL RECOGNITION RATE

ReferenceEmotion

Recognized Emotionpos neu neg

pos 92.7 7.3 0neu 9.7 80.7 9.6neg 1.7 8.3 90.0

TABLE IIGENDER DEPENDENT CONFUSION TABLE USING CLEAN SPEECH FOR

LISA-AVDB DATABASE WITH 93.9% OVERALL RECOGNITION RATE

ReferenceEmotion

Recognized Emotionpos neu neg

pos 96.4 1.2 2.4neu 8.5 90.3 1.2neg 0 5.0 95.0

that noise cancellation greatly improves the performance. Inparticular, improvement of approximately 20% is achievedat 5 dB SNR for different noise modules. It can also benoted that the system performs better for stationary noise(WGN). High recognition accuracy of noisy speech at higherSNR can be attributed to relatively matched training andtesting conditions as the database is collected in car settings.Similar trend can also be observed for database collected incontrolled studio settings [21].

VII. CONCLUSION AND DISCUSSION

Recognition of emotions in the speech signal is highlysensitive to the acoustic environment. We have shown thatadapting to the context of outside environment and the usersof the car greatly improves the system’s performance. Inparticular, adaptive noise cancellation technique and use ofgender information provide better overall recognition accu-racy. It would be interesting to see the performance of systemwhich can combine the two technique together. We proposeto investigate that as part of our future work. We wouldalso like to evaluate the performance of overall system whenadded with automatic gender detection as preceding stage.Analysis also suggested that at higher SNR levels even thenoisy speech can provide good recognition due to relativelymatched training and testing environment. This strengthensour determination to collect more real world driving data.In our ongoing effort, we have been acquiring data ofnatural conversations between driver and passenger in bothaudio and video domain. In future, we will analyze videomodality and extend the framework for emotion analysis foraudiovisual cues.

ACKNOWLEDGMENTS

The authors gratefully acknowledge reviewers’ comments.We are thankful to our colleagues at CVRR lab for usefuldiscussions and assistance.

REFERENCES

[1] Y. Noy, “International harmonized research activities report of workinggroup on intelligent transportation systems (its),” in Proc 17th Int.Tech. Conf. Enhanced Safety of Vehicles, 2001.

TABLE IIIEMOTION RECOGNITION PERFORMANCE USING DIFFERENT NOISE

MODELS AT THREE SNR LEVEL FOR LISA-AVDB DATABASE

Overall recognition accuracy (%) for noisy and enhanced speech

Test CasesSignal-to-Noise Ratio (dB)

15 10 5Noisy Enhanced Noisy Enhanced Noisy Enhanced

WGN 81.7 82.1 73.6 78.5 56.6 75.4FWY 75.8 76.8 75.0 76.8 55.3 62.5PRK 78.1 77.57 71.5 73.2 53.6 65.1CST 78.5 78.1 71.0 75.8 56.7 69.1

[2] H. Lunenfeld, “Human factor considerations of motorist navigationand information systems,” in Vehicle Navigation and InformationSystems Conference, 1989.

[3] D. L. Strayer and W. A. Johnston, “Driven to distraction: Dual-taskstudies of simulated driving and conversing on a cellular telephone,”Psychological Science, vol. 12, no. 6, pp. 462–466, 2001.

[4] N. Murray, H. Sujan, E. R. Hirt, and M. Sujan, “The influence of moodon categorization: A cognitive flexibility interpretation,” Journal ofPersonality and Social Psychology, vol. 59, no. 3, pp. 411–425, 1990.

[5] T. E. Galovski and E. B. Blanchard, “Road rage: a domain forpsychological intervention?” Aggression and Violent Behavior, vol. 9,no. 2, pp. 105 – 127, 2004.

[6] I. marie Jonsson, C. Nass, H. Harris, and L. Takayama, “Matchingin-car voice with driver state: Impact on attitude and driving perfor-mance,” in Proc of International Driving Symposium on Human Factorin Driver Assessment, Training and Vehicle Design, 2005, pp. 173–181.

[7] L. Malta, P. Angkititrakul, C. Miyajima, and K. Takeda, “Multi-modalreal-world driving data collection, transcription, and integration usingbayesian network,” in Intelligent Vehicles Symposium, 2008 IEEE, june2008, pp. 150 –155.

[8] L. Devillers and L. Vidrascu, “Real-life emotions detection with lexicaland paralinguistic cues on human-human call center dialogs,” in ProcNinth Int’l Conf. Spoken Language Processing (ICSLP), 2006.

[9] M. G. et. al, “On the necessity and feasibility of detecting a driver’semotional state while driving,” in ACII, 2007, pp. 126–138.

[10] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A survey of affectrecognition methods: Audio, visual, and spontaneous expressions,”IEEE Transactions on PAMI, vol. 31, no. 1, pp. 39–58, Jan. 2009.

[11] J. A. Healey and R. W. Picard, “Detecting stress during real-worlddriving tasks using physiological sensors,” IEEE Transactions onIntelligent Transportation Systems, vol. 6, pp. 156–166, 2005.

[12] C. M. Jones and I.-M. Jonsson, “Performance analysis of acousticemotion recognition for in-car conversational interfaces,” in HCI (6),2007, pp. 411–420.

[13] C. M. Lee and S. S. Narayanan, “Toward detecting emotions in spokendialogs,” IEEE Transactions on Speech and Audio Processing, vol. 13,no. 2, pp. 293–303, 2005.

[14] D. Ververidis and C. Kotropoulos, “Automatic speech classification tofive emotional states based on gender information,” in Proc. EuropeanSignal Processing Conference (EUSIPCO), june 2004, pp. 341–344.

[15] A. Tawari and M. M. Trivedi, “Context analysis in speech emotionrecognition,” IEEE Transaction on Multimedia, 2010.

[16] Y. Ghanbari and M. R. Karami-Mollaei, “A new approach for speechenhancement based on the adaptive thresholding of the wavelet pack-ets,” Speech Communication, vol. 48, no. 8, pp. 927 – 940, 2006.

[17] B. Paul, “Accurate short-term analysis of the fundamental frequencyand the harmonics-to-noise ratio of a sampled sound,” in Inst ofPhonetic Sciences 17, 1993, pp. 97–110.

[18] S. R. Garner, “Weka: The waikato environment for knowledge analy-sis,” in Proc. of the New Zealand Computer Science Research StudentsConference, 1995, pp. 57–64.

[19] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-outof a vehicle: Computer-vision-based enhanced vehicle safety,” IEEETransactions on Intelligent Transportation Systems, March 2007.

[20] S. T. Shivappa, B. D. Rao, and M. M. Trivedi, “Role of head poseestimation in speech acquisition from distant microphones,” in IEEEInternational Conference on Acoustics, Speech and Signal Processing,2009, pp. 3557–3560.

[21] A. Tawari and M. M. Trivedi, “Speech emotion analysis in noisyreal world environment,” in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR), 2010.

178

Contextual Framework for Speech Based Emotion Recognition...

Documents

Transcript of Contextual Framework for Speech Based Emotion Recognition...