Novel Multimodel Approach for Marathi Speech Emotion Detection · 2020. 2. 17. · Novel Multimodel...

Novel Multimodel Approach for MarathiSpeech Emotion Detection

Vaijanath V. Yerigeri and L. K. Ragha

Abstract Speech is a human vocal communication using the vocal tract. Whilespeaking, the speaker may perform verities of intentional speech act. Voice tonesof different emotions are easily understood even by primitive. Voice is the first andforemost reactive action by homosapiens to express intended emotions. In today’s fastworld every professional human being is under an enormous amount of stress. Stressis a serious threat to human health because it is proving as a silent killer. The remedyto counter this problem is early detection of symptomswhich will reduce these cases.This paper focuses on stress-emotion detection in the regional language, i.e., Marathiand presents a multimodel approach. Different features like Mel-Frequency CepstralCoefficient (MFCC), energy, pitch, and vocal tract frequency. These features areextracted from the given speech signal and the same has been trained using ArtificialNeural Network (ANN). The implementation is done inMATLAB. The combinationof a multi-model feature set with ANN gives satisfactory results up to 83–84% foridentifying negative emotion, i.e., sadness and anger in speech.

Keywords Speech emotion recognition system · Feature extraction ·Mel-frequency Cepstral coefficient · Vocal tract frequency · Pitch · Energy ·Artificial neural network

1 Introduction

Charles Darwinmonographed on emotion expression by homosapiens [1]. Thismile-stone work generated a new wave of interest in psychologists and they graduallystarted collecting information and increased the knowledge-base. Human emotionshave a long evolutionary purpose for our survival as a species. Psychologists observedthat homosapiens use lots of nonverbal cues like facial expressions, gesture, body

V. V. Yerigeri (B)M.B.E.S. College of Engineering, Ambajogai 431 517, M.S, India

L. K. RaghaTerna Engineering College, Nerul, Navi-Mumbai, M.S, India

© Springer Nature Singapore Pte Ltd. 2020V. Bhateja et al. (eds.), Intelligent Computing and Communication,Advances in Intelligent Systems and Computing 1034,https://doi.org/10.1007/978-981-15-1084-7_20

195

http://crossmark.crossref.org/dialog/?doi=10.1007/978-981-15-1084-7_20&domain=pdf

https://doi.org/10.1007/978-981-15-1084-7_20

196 V. V. Yerigeri and L. K. Ragha

language or tone of voice, to convey their emotions. Emotions are expressions of aninternal thought process or maybe a reaction to an external stimulus.

Cognitive science is the interdisciplinary branch which examines human cogni-tion. This science deals with emotion, reasoning, attention, memory, perception, andlanguage. To understand these functions technological support from Artificial Intel-ligence (AI), linguistic and neuroscience is required. Cognitive science works on thefundamental concept that “process of thinking may be best understood by represen-tational structures of the mind and the computational procedures that can be appliedto those structures”. Identifying emotions during social interaction is paramount andis done subconsciously by humans. Building and maintaining human relationshipsdepends upon a person’s ability to understand emotions communicated by other per-son(s). Misidentification of emotion(s) may result in breaking up of relationshipsand hence the social detachment.

Voice emotions are clearly recognized by all the people [2–4]. Even the mostprimitive can recognize the tones of different emotions. Animals can also recognizethe meaning of the human voice. Thus the tone language is universal for all means ofcommunication and can convey a physical state that includes personality, appearance,intelligence [5] age group, and gender of a speaker [6].

The author is working on Speech Emotion Recognition (SER) because, in thecase of face emotion recognition, both the speakers should meet personally or haveto be on a video call in case of distant communication. Even though the technologyhas advanced where video call is feasible; still it has its own limitation due to videostreaming speed and quality. So voice call is preferred way of communication whichdoes not put the binding on speakers to meet physically. A phone call or Voice overInternet Protocol (VoIP) is the best and convenient way.

SER is a sub-branch of speech perception research. It uses AI technology so itis therefore also referred to as Emotion AI (EAI) or Artificial Emotion Recognition(AER) or Affective computing (AC) [7]. AC deals with recognizing, interpreting,understanding and replicatinghumanemotions [8]. The success ofHuman–ComputerInterface (HCI) crucially depends upon the accuracy of SER system [9, 10].

Cowie and Cornelius [11] proposed a 2D model for empirical analysis which isshown below.

Dimension

Valence Activation

Negative and positive emotions

It’s related to strength of person’s inherent quality (Character and Mind)

Valence indicates emotion is positive (happy, joy, trust) or negative (sad, disgust,fear, anger). Activation reflects the inherent qualities of persons like character andmindset. It represents the strength of a person to react in a particular situation. Speech

Novel Multimodel Approach for Marathi Speech Emotion Detection 197

emotions are verified statistically as well empirically. Two-dimensional model isnecessary as well as useful in deriving empirical results. Hearer can rate the emotionspresented by the speaker in concernwith activation and valence dimension. After thatstatistical operations will be performed on the speech signal to determine emotion(s).Empirical results will be compared with the statistical to calculate the ground truth[12], presented amodel having different emotions states, e.g., joy, trust, fear, surprise,sadness, anger, disgust, and anticipation.

SER being multimodal, it is a complex system. But due to application in medicaldiagnosis, it has obtainedmore attention. SER can discernmental disorder [13]. It canalso diagnose Alzheimer and Parkinson [14]. Increasing suicide rates, catastrophicweather, nasty political climate, everyday annoyance like nasty colleagues, traindelay, traffic, etc., are outside factors of stress. Looking for a job, managing ownchronic condition, parenting an anxious, depressed, autistic, disabled, learning child;caring for loved one, etc. are home stressors. People working in the military andInformation Technology (IT) sector are highly stressed due to workload. Childrenare stressed due to cut-throat competition in the education field. Thus in modernworld stress is constant. Stress will be chronic if it is paired with uncontrolledcircumstances. It may lead to the physical problem (alcoholism, asthma, fatigue)or even impulsive behavior like committing suicide. Therefore, stress is defined asSilent Killer (SK) by medical experts [15]. Solution to this problem is ComputerVoice Stress Analysis (CVSA). SER is sub-branch of CVSA. SER gives realistic andnatural HCI [16].

The objective of the research paper is to detect stress by analyzing speech of aperson, spoken in Marathi language. Escalating growth of Information Technology(IT) sector in Maharashtra is motivational factor for selecting this language. Latehour working and immense mental stress due to cut-throat competition, approxi-mately 68% of Indian IT professionals are heavily stressed or depressed [17]. Over-stress leads to health issues and attracts verities of diseases like acid peptic, asthma,fatigue, diabetics [18–20]. Even the intake of alcohol increases substantially. The sit-uation is vicious in itself, i.e., stress leads to physical problems and physical issuesleads to more depression. Human emotions have high correlation with Fatigue anddepression. One can figure out emotional prosody i.e. speech tone of individual isconveyed through changes in speech rate, timber, loudness, pitch and pauses whichis different from semantic and linguistic information. Thus speech is normal outletto demonstrate stress.

AbdelWahab and Busso [21] used artificial intelligence. Different databases weretrained and tested using themulti-corpus framework. The adaptive SupervisedModel(ASM) was considered to improve system performance. To evaluate the ruggednessof a system they used mismatched training and testing conditions. The concept wasAdaptive Supervised Learning (ASL). Jin et al. [22]worked on low-level features likejitter, shimmer, special contour, fundamental frequency (F0), intensity, etc. Based onlow-level features they generated acoustic feature presentations those are divergent.It includes statistical features as well. A new representation was a combination ofGaussian Super vectors and a set of low-level acoustic code words. GammatoneCepstral Coefficients of speech signalwere proposed byGarg andBahl [23].Variance


Modeling and Discriminant Scale Frequency Maps (VM-DSFM) was proposed byWang et al. [24].

2 Database Creation

There are six basic emotions, namely happy, sad, disgust, anger, surprise, and fear.Emotions may be detected using facial expressions and physiological methods, butit has drawbacks [25]. Databases for different languages are available like Danish,KISMET, BabyEars, SUSAS, MPEG-4, etc. [26]. CMU INDIC is the only sourcewhich provides Marathi database. Therefore, Vishal et al. recorded 25 audio of maleaswell as female speaker of agegroup ranging from21 to41years [27].He consideredthree emotions and each speaker recorded 24 words per emotion. This database is notmade available to public. No benchmark database is available for Marathi language[28]. Therefore, creating a database was a desideratum.

Spoken language depends upon gender, age group, accent, and geographicalregions. While recording audio clip parameters like sampling rate, duration, typeof recording (mono/stereo), number of bits, etc. should be set in accordance to getthe maximum result out of it. As the system is supposed to be used by commonpeople, one should avoid recording by professionals. Recording by professionalsmay add up superlative emotions [29, 30]. Based on these criteria, the author createdMarathi database considering variations, shown as follows.

1. Age group: young one (7–10 years), adolescent (15–18), youngsters (18–35),middle age (35–55) and old age (above 60).

2. Gender: Male/Female.3. Emotions: happy, anger, sad, surprise fear, and neutral.4. Sentence size: 5–7 words.5. Repetition rate −3.6. Types of statement per speaker: 10.7. Geographical zones covered: Pune, Ambajogai, Jalgaon, Kolhapur, Nagpur

SpokenMarathi has variations based on district, locality in its accent, and tone.The author considered points 1–7 and created 6000 recordings in .WAV format. Therecording was done on PC, with a sampling rate of 8000 Hz, 8 bits and mono, usingsound recorder of Windows 7 operating system. Normally sound recorder provides.WMA format. The same was then converted to .WAV format.

3 Formulating Problem Statement: Block Diagram of SER

Emotions are exceptionally associated to speech features, e.g., energy, pitch, jitter,shimmer, eloquent, timing, etc. [20, 31]. Varieties of models of emotions in differentdimensions are presented by the researcher [30]. Arousal is correlated with intensity


or energy with which the speaker conveys his/her emotion. High activation is relatedto anger and happiness emotion. Thus, for a robust system, one should considernumber of features.

Refer Fig. 1. In SER two blocks are major, i.e. feature extraction and classifica-tions. Phonetic and prosodic two major feature sets. Features extracted for the SERis described in Sect. 5. SER system uses a supervised learning system to classifyfeatures extracted from speech signal. There will be two phases, training and testing.

MATLAB is used for Monte carlo simulation. The software can handle audio fileformats like .wav/.wma/.mp3. The input speech signal is first preprocessed. Speechsignal should have a mono track, but still, if input given by the user is stereo trackthen it will be converted to a mono track. Normalizing speech signal, removal ofnoise and segregating unvoiced-voiced sound are the next stages in preprocessing[32].

The author has proposed features to be extracted from audio wave are energy,vocal tract, Mel-frequency Cepstrum Coefficient (MFCC) and pitch. These featuresare extracted for varieties of audio signals described in Sect. 2. After operating upona certain percentage of dataset, the feature vector is generated. The generated vectoris trained by the SupervisedMachine Learning (SML) algorithm [33]. SML producesa network file (‘.net’) based on the number of datasets. The file will be later used inthe testing phase.

In the testing phase, real-time audio or the audio file which was not the part ofthe training phase will be input to SER. The next two stages, i.e., preprocessing andfeature extraction, are the same as that of the training phase.New feature set generatedwith new input file will be input to the classifier and the same will be juxtaposed withthe trained model. Euclidean distance will be computed. The minimum distance willshow the correlation of input audio with a particular emotion. The confusion matrixis presented in Sect. 7. It depicts the accuracy of a system.

Training speech files

Testing Speech input for emotion

Speech Input

Speech Preprocessing

Feature extraction

Training of extracted features

Generating NET file

Result Speech Preprocessing

Feature extraction

Classifier

Trained Network (NET)

Speech Input

Fig. 1 Block diagram of SER system


4 Proposed Work––Preprocessing

For stereo recorded audio files conversion to mono track is necessary. This canbe done by performing element-wise addition and averaging out the speech value.If the number of recording channels is N and the length of recording is M then,

y(m) =∑N

n=0 S(n)

N m = 0, 1, . . . , M . The speech signal value varies from 0 to 255.By normalization operation, the same is brought to within the range of 0–1. Thismakes the algorithm independent of the basic value of speech signal. Noise removalis achieved using the threshold technique. The signal value below a certain level willbe considered as noise signal and is given as follows.

Algorithm for separating voiced/unvoiced sound:1. Frame the signal into short non overlapped frames.2. Apply hamming window to all the frames. 3. Calculate short time energy of the output generated from step – 2. 4. Calculate zero crossing rates of the frames generated in step -1.5. If (zero crossing rate < Threshold # short time energy >Threshold )

Voiced Frame segmentFrame is Non-voiced segment

5 Proposed Work––Feature Extraction

Proposed work extracts four features of the speech signal and uses it for trainingpurpose, thus presents a multi-model approach for emotion detection.

a. Mel-Frequency Cepstral Coefficients (MFCC):

MFCC provides short term power spectrum presentation. MFCC uses nonlinearMel scale. It calculates cosine transform (linear) of log power spectrum. Advantageof using MFCC is for better representation of speech wave that matches with theresponse of the human auditory system [34].


MFCC Algorithm:1. Frame the signal into short frames.2. Split speech buffer into separate segments.3. Power spectrum is calculated by periodogram estimation

of each and every segments.4. To calculate mel frequency use Equation-1[16]. 5. ……………(1)6. Mel filter bank is applied to power spectrum.7. Energy of each filter is calculated and summed up.8. Calculate log of energy of all the filter bank.9. Output of step 6 is used to calculate Discrete Cosine

Transform (DCT).10. Retain 2 to 13 coefficients of DCT.

In the proposed scheme, MFCC feature of all different speakers with varieties ofemotions is extracted.

b. Vocal Tract Frequency:

Vocal tract (VT) of humans is typically 17–18 cm, considering VT as a closed cylin-der, the typical frequency generated is 500 Hz. This prediction leads to set of for-mant frequencies, i.e., 0.5, 1.5, and 2.5 kHz. But these frequencies may change asthe articulator introduces vowel sounds. Male voice formant frequencies are firstformant 0.15–0.85 kHz, second formant 0.5–2.5 kHz, third formant 1.5–3.5 kHz,and fourth formant 2.5–4.8 kHz.

c. Pitch:

Pitch is the main acoustic correlate of tone and intonation. Pitch is the number ofvibrations produced by the vocal tract per second. Ear detects pitch by highness orlowness of tone.

d. Energy

Teager energy operator (Eq. 2) [30] is used to calculate the energy of speech signal.

Ψ [S(n)] = S2(n) − S(n − 1) ∗ S(n + 1) (2)


Table 1 ANN classifier pseudocode

6 Proposed Work––Artificial Neural Network (ANN)

After the feature extractionmatrix is created for different types of emotions expressedby different speakers. These features are trained using Artificial Neural Network(ANN)––a pattern recognition technique. ANN is also a machine learning algorithm[35]. The base of neural network learning is on a similar line to biological neurons.Structure of ANN is input, hidden layers, and output. Each stage has a number ofnodes that are interconnected. The training phase generates .netstructure. The sameshould be used for testing system performance. Pseudocode for ANN is shown inTable 1.

7 Results

For implementation purpose, MATLAB (2014b) is used. Figure 2 shows the actualspeech signal plot. Figure 3 shows output after removing the unvoiced or silencepart. Only voiced part considered for the feature extraction process.

Feature extraction is the base for the classifier. The multi-model approach is usedto make the system robust and reliable. 30% part of the database is used for training


Fig. 2 Speech signal

Fig. 3 Voiced part of speech signal

purpose. Before initiating training it has been observed that extracted feature providesvariations based on emotions. The following diagram represents a 3D graph. The X-axis represents the average number of speakers. Y-axis represents different featureslikeMFCC, vocal tract, pitch, and energy. Z-axis represents different emotions Fig. 4.

Feature database set was trained using ‘nftool’ in MATLAB. Out of 6000 speechfiles, 30%, i.e., 1800 speech files (360 files per emotions)were trained. Trainedmodelis used for testing of the database. Figure 5 depicts performance graph.

Refer Fig. 4. The plot specifies that MFCC features of emotions are having veryless overlap, hence, they are distinguishable. Stress related emotion viz. Sad andangry are distinct as compared to happy, neutral and surprise, in all features. In sadcondition, pitch and energy both are low. In angry emotion VT, pitch and energy allare maximum level.


(a) (b)

(c) (d)

Fig. 4 a MFCC plot. b Vocal tract frequency plot. c Pitch plot. d Energy plot

Fig. 5 Performance plot


Table 2 ANN classifier performance analysis

Emotions Emotion Recognition %

Sad Angry Surprise Happy neutral

Sad 87 0 0 0 13

Angry 0 80 12 8 0

Surprise 0 6 85 9 0

Happy 0 7 10 75 8

Neutral 6 0 0 4 90

As shown, the validation and test graph are near to best. Accuracy is calculatedbased on the following equation.

Accuracy = Number of correct emotions detected

Total number of samples of that emotion× 100

Table 2 depicts overall results:

8 Conclusion

Database creation is the base of the Marathi language emotion detection system.Diversity in database creation is the need of the day. In India, for the same languagepronunciations differs with the zone. Covering all these zones was a great challenge.Multi-model approach for extracting features is suggested in this research workwith ANN as a classifier. Stress typically introduces negative emotions in humanmind. Depression and anxiety may lead to sad/angry emotions. So if the averageof performance accuracy of these both emotions are considered then it is around83–84%.

Present multi-model approach deals with four sets of features. The same may befurther extended by adding quality features like jitter and shimmer.

References

1. Darwin, C.: The Expression of Emotion in Man and Animals. JohnMurray, London (1872)2. Blanton, S.: The voice and the emotions. Q. J. Speech 1(2), 154–172 (1915)3. Fairbanks, G., Pronovost, W.: Vocal pitch during simulated emotion. Science 88(2286), 382–

383 (1938)4. Soskin, W.F., Kauffman, P.E.: Judgment of emotion in word-free voice samples. J. Commun.

11(2), 73–80 (1961)5. Kramer, E.: Elimination of verbal cues in judgments of emotion from voice. J. Abnorm. Soc.

Psychol. 68(4), 390 (1964)


6. Bozkurt, O.O., Taygi, Z.C.: Audio-based gender and age identification. In: 22nd SignalProcessing and Communications (2014). https://doi.org/10.1109/siu.2014.6830493

7. McCarthy, J.: What is artificial intelligence (2007). http://wwwformal.stanford.edu/jmc/whatisai.html

8. Higginbotham, A.: Welcome to Rosalind Picard’s touchy-feelyworld of empathictech (2011).http://www.wired.co.uklmagazine/archive/2012/11(features/emotion-machines)

9. Calvo, R.A., D’Mello, S.: J. IEEE Trans. Affect. Comput. Arch. (IEEEComputer Society PressLos Alamitos, CA, USA) (1), 18–37 (2010)

10. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, clas-sification schemes, and databases. Pattern Recognit. (2011). https://doi.org/10.1016/j.patcog.2010.09.020

11. Cowie,R., Cornelius, R.R.:Describing the emotional states that are expressed in speech. SpeechCommun. (Elsevier Science) 40, 5–32 (2003)

12. Ekman, P.: An argument for basic emotions. pp. 169–200. 07 Jan 2008. https://doi.org/10.1080/02699939208411068

13. Kostoulas.: Emotion recognition from speech using digital signal (2012)14. López-de-Ipiña,K.,Alonso, J.-B., Travieso,C.M., Solé-Casals, J., Egiraun,H., Faundez-Zanuy,

M., Ezeiza, A., Barroso, N., Ecay-Torres, M., Martinez-Lage, P., Martinez, U.: On the selectionof non-invasive methods based on speech analysis oriented to automatic alzheimer diseasediagnosis. Sensors. ISSN 1424-8220. www.mdpi.com/journal/sensors

15. https://www.everydayhealth.com/wellness/united-states-of-stress/16. Guo, B., Hershey, P.A.: Creating Personal, Social, and Urban Awareness Through Pervasive

Computing. Information Science Reference (2014)17. Schuller, B., Rigoll, G., Lang, M.: Speech emotion recognition combining acoustic features

and Linguistic information in a hybrid support vector machine belief network architecture. In:Proceedings of the ICASSP 2004, vol. 1 (2004)

18. Williams and Stevens.: Robust Emotion Recognition using Spectral and Prosodic Features.Springer (2013)

19. Benesty, J., Sondhi, M.M., Huang, Y.: Handbook of Speech Processing. Springer(2008)20. Rao, K.S., Koolagudi, S.G.: Emotion Recognition Using Speech Features. Springer (2013)21. Abdelwahab, M., Busso, C.: Supervised domain adaptation for emotion recognition from

speech. IEEE (2015). ISSN. 1520-614922. Jin, Q., Li, C., Chen, S.,Wu, H.: Speech emotion recognition with acoustic and lexical features.

IEEE (2015). ISSN. 1520-614923. Garg, E., Bahl, M.: Emotion recognition in speech using gammatone cepstral coefficients. Int.

J. Appl. Innov. Eng. Manag. (IJAIEM) 3(10) (2014). ISSN 2319-484724. Wang, J.-C., Chin, Y.-H., Chen, B.-W., Lin, C.-H., Chung-Hsien, W.: Speech emotion verifi-

cation using emotion variance modeling and discriminant scale frequency maps. IEEE Trans.Speech Lang. Process. 23(10), 1552–1562 (2015)

25. Hasrul, M.: Human affective (Emotion) behaviour analysis using speech signals: a review. In:2012 International Conference on Biomedical Engineering ICoBE 2012, pp. 27–28 (2012)

26. Shrishrimal, P.P.,Deshmukh,R.R.,Waghmare,V.B.: Indian language speechdatabase: a review.Int. J. Comput. Appl. 47(5), 17–21 (2012)

27. Waghmare, V.B., Deshmukh, R.R., Shrishrimal, P.P., Janvale, G.B.: Development of isolatedMarathi words emotional speech database. IJCA 94(4), 0975–8887, 19–22 (2014)

28. Pahune, S., Mishra, N.: Emotion recognition through combination of speech and imageprocessing. Int. J. Recent Innov. Trends Comput. Commun. (2015). ISSN: 2321-8169

29. Nayak, B., Madhusmita, M., Sahu, D.K.: Speech emotion recognition using different centredGMM, 3(9) (2013). ISSN: 2277 128X

30. El Ayadi, M., Kamel, M.S. Karray, F.: Survey on speech emotion recognition: features,classification schemes, and databases. Pattern Recogn. 44, 572–587 (2011)

31. http://www.phon.ucl.ac.uk/courses/spsci/expphon/week9.php32. Elissa, K.: Input processing for cross language information access (1991–92). ISSN No. 0972-

645

https://doi.org/10.1109/siu.2014.6830493

http://wwwformal.stanford.edu/jmc/whatisai.html

http://www.wired.co.uklmagazine/archive/2012/11(features/emotion-machines)

https://doi.org/10.1016/j.patcog.2010.09.020

https://doi.org/10.1080/02699939208411068

http://www.mdpi.com/journal/sensors

https://www.everydayhealth.com/wellness/united-states-of-stress/

http://www.phon.ucl.ac.uk/courses/spsci/expphon/week9.php


33. Eyben, F., Scherer, K., Schuller, B., Sundberg, J., Andre, E., Busso, C., Devillers, L., Epps,J., Laukka, P., Narayanan, S., Truong, K.: The Geneva minimalistic acoustic parameter set(GeMAPS) for voice research and affective computing. IEEE Trans. (2015)

34. Waghmare,V.B.,Deshmukh,R.R., Shrishrimal, P.P., Janvale,G.B.:Emotion recognition systemfrom artificial marathi speech using MFCC and LDA techniques. Elsevier (2014)

35. http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Novel Multimodel Approach for Marathi Speech Emotion Detection · 2020. 2. 17. · Novel Multimodel...

Documents

Transcript of Novel Multimodel Approach for Marathi Speech Emotion Detection · 2020. 2. 17. · Novel Multimodel...