GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF...

5
Global Journal of Advanced Engineering Technologies, Vol3, Issue4- 2014 ISSN (Online): 2277-6370 & ISSN (Print): 2394-0921 www.gjaet.com Page | 445 GFCC BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH Gurpreet Kaur 1 , Abhilash Sharma 2 1 Research Scholar, Department of CSE, RIMT-IET, Mandi Gobindgarh, Fatehgarh Sahib, Punjab, India. 2 Assistant professor, Department of CSE, RIMT-IET, Mandi Gobindgarh, Fatehgarh Sahib, Punjab, India. Abstract: Emotion Recognition System is the automatic recognition of the emotional state of the speaker. There exist many classifiers for the classification of emotions in speech. This paper focuses on the method to extract features from the speech signal. GFCC (Gammatone Frequency Cepstral Coefficients) are extracted as features and BFO algorithm is being used for the optimization of the feature values On the basis of GFCC features the emotions ( sad , joy, aggressive, fear, surprise )are classified using Back Propagation Neural Network. The whole simulation is taken place in MATLAB environment. The Results of the proposed technique are compared with MFCC and BPNN technique. The Results outperforms in terms of the recognition rate. Keywords: Emotion Recognition, GFCC, BFO, Back Propagation, Neural networks, human Computer Interface. I. INTRODUCTION Automatic emotion recognition from speech signal by the computer is very difficult task as humans can understand emotions but machines cannot. There was the big challenge for the researchers to invent the automatic emotion recognition systems. Now it is challenge for the researchers to make this automatic emotion recognition system more and more accurate. Thus Real Time emotion detection has been a research area for the last two decades. From the speech signal various pieces of the data are extracted commonly known as feature of the speech signal. On the basis of these features emotions can be recognized. Speech Emotion Recognition has several applications such as call centers management, commercial products, life- support systems, virtual guides, customer service, lie detectors, conference room research, emotional speech synthesis, art, entertainment and others. In a generalized way, a speech emotion recognition system is an application of speech processing in which the patterns of derived speech features (MFCC, pitch) are mapped by the classifier (HMM) during the training and testing session using pattern recognition algorithms to detect the emotions from each of their corresponding patterns. The accuracy of Speech Emotion Recognition System depends upon the feature extraction and the classification .The Neural Classifier’s accuracy depends upon the number of the inputs in terms of features given to it. A typical emotion recognition System has four main stages –signal preprocessing, feature extraction, feature preparation or selection and classification. The audio speech signals are the input to the system. These may contain some noise signals also hence these noises are to be removed for better feature extraction. Noise removal is done in the preprocessing stage. After this the features are extracted from the input speech that will differentiate between the emotions. The feature optimization algorithms are applied to get feature vectors. These feature vectors are presented to the classifier in training and testing session .mainly the research has been done with the aim to bring closer both humans and computer with each other by recognizing mood swings. It has various application such as lie detector, military, call centers human-robotic interface, psychiatric aid etc.[1]. This paper is organized as Section II will discuss related work, Section III includes proposed work methodology. Section IV includes results and Finally Section V will contain the conclusion. II. RELATED WORK Many researches has been done on extorting the different features of the speech and then classify these features with different classifiers. The common used features are MFFC. This Section presents the survey of the various classification techniques in combination with the feature extraction techniques A. Milton et al. presented a SVM based Technique for classification of the emotions in speech. They use a 3- stage Support Vector Machine classifier to classify seven different emotions present in the Berlin Emotional Database. For the purpose of classification, MFCC features present in the database are extracted. [1] Balaji Vasan Srinivasan et al[2] extract 57 mel- frequency Cepstral coefficients (MFCC) features and classify them with Kernel partial least squares (KPLS) used for Discriminative training in the ivector space. They gain 8.4% performance Improvement (relative) in terms of EER. Martinez, J. et al. [3] extract mel- frequency Cepstral coefficients (MFCC) features and classify them with Vector quantization Technique. They gain 100% of precision with a database of 10 speakers. SVM. Garg. Vipul [4]et al proposed Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers. They gain 83% recognition results for closed set. Prof Sujata et al[5] proposed Linear prediction coefficients (lpc) and Neural network (nn) based technique for emotion recognition and the Accuracy rate was 46%. N. Murali Krishna. et al [6] proposed an Emotion recognition using dynamic time warping technique for

Transcript of GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF...

Page 1: GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH Gurpreet Kaur1, Abhilash ... classifiers …

Global Journal of Advanced Engineering Technologies, Vol3, Issue4- 2014 ISSN (Online): 2277-6370 & ISSN (Print): 2394-0921

www.gjaet.com Page | 445

GFCC BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH

Gurpreet Kaur1, Abhilash Sharma2

1Research Scholar, Department of CSE, RIMT-IET, Mandi Gobindgarh, Fatehgarh Sahib, Punjab, India.2Assistant professor, Department of CSE, RIMT-IET, Mandi Gobindgarh, Fatehgarh Sahib, Punjab, India.

Abstract: Emotion Recognition System is the automatic recognition of the emotional state of the speaker. There exist many classifiers for the classification of emotions in speech. This paper focuses on the method to extract features from the speech signal. GFCC (Gammatone Frequency Cepstral Coefficients) are extracted as features and BFO algorithm is being used for the optimization of the feature values On the basis of GFCC features the emotions ( sad , joy, aggressive, fear, surprise )are classified using Back Propagation Neural Network. The whole simulation is taken place in MATLAB environment. The Results of the proposed technique are compared with MFCC and BPNN technique. The Results outperforms in terms of the recognition rate.Keywords: Emotion Recognition, GFCC, BFO, Back Propagation, Neural networks, human Computer Interface.

I. INTRODUCTIONAutomatic emotion recognition from speech signal by the computer is very difficult task as humans can understand emotions but machines cannot. There was the big challenge for the researchers to invent the automatic emotion recognition systems. Now it is challenge for the researchers to make this automatic emotion recognition system more and more accurate. Thus Real Time emotion detection has been a research area for the last two decades. From the speech signal various pieces of the data are extracted commonly known as feature of the speech signal. On the basis of these features emotions can be recognized. Speech Emotion Recognition has several applications such as call centers management, commercial products, life-support systems, virtual guides, customer service, lie detectors, conference room research, emotional speech synthesis, art, entertainment and others. In a generalized way, a speech emotion recognition system is an application of speech processing in which the patterns of derived speech features (MFCC, pitch) are mapped by the classifier (HMM) during the training and testing session using pattern recognition algorithms to detect the emotions from each of their corresponding patterns. The accuracy of Speech Emotion Recognition System depends upon the feature extraction and the classification .The Neural Classifier’s accuracy depends upon the number of the inputs in terms of features given to it. A typical emotion recognition System has four main stages –signal preprocessing, feature extraction, feature preparation or selection and classification. The

audio speech signals are the input to the system. These may contain some noise signals also hence these noises are to be removed for better feature extraction. Noise removal is done in the preprocessing stage. After this the features are extracted from the input speech that will differentiate between the emotions. The feature optimization algorithms are applied to get feature vectors. These feature vectors are presented to the classifier in training and testing session .mainly the research has been done with the aim to bring closer both humans and computer with each other by recognizing mood swings. It has various application such as lie detector, military, call centers human-robotic interface, psychiatric aid etc.[1].This paper is organized as Section II will discuss related work, Section III includes proposed work methodology. Section IV includes results and Finally Section V will contain the conclusion.

II. RELATED WORKMany researches has been done on extorting the different features of the speech and then classify these features with different classifiers. The common used features are MFFC. This Section presents the survey of the various classification techniques in combination with the feature extraction techniquesA. Milton et al. presented a SVM based Technique for classification of the emotions in speech. They use a 3-stage Support Vector Machine classifier to classify seven different emotions present in the Berlin Emotional Database. For the purpose of classification, MFCC features present in the database are extracted. [1] Balaji Vasan Srinivasan et al[2] extract 57 mel-frequency Cepstral coefficients (MFCC) features and classify them with Kernel partial least squares (KPLS) used for Discriminative training in the ivector space. They gain 8.4% performance Improvement (relative) in terms of EER. Martinez, J. et al. [3] extract mel-frequency Cepstral coefficients (MFCC) features and classify them with Vector quantization Technique. They gain 100% of precision with a database of 10 speakers. SVM. Garg. Vipul [4]et al proposed Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers. They gain 83% recognition results for closed set. Prof Sujata et al[5] proposed Linear prediction coefficients (lpc) and Neural network (nn) based technique for emotion recognition and the Accuracy rate was 46%. N. Murali Krishna. et al [6] proposed an Emotion recognition using dynamic time warping technique for

Page 2: GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH Gurpreet Kaur1, Abhilash ... classifiers …

Global Journal of Advanced Engineering Technologies, Vol3, Issue4- 2014 ISSN (Online): 2277-6370 & ISSN (Print): 2394-0921

www.gjaet.com Page | 446

isolated words .they use MFCC, delta coefficients (δMFCC) and delta coefficients (δδMFCC)and Dynamic time warping (dtw), Svm classifier. Recognition rate came out was above 78%. Emily mower et al proposed framework for automatic human emotion classification using emotion profiles.they use Mel filterbank cepstral coefficients (MFCCs) features and Ep-svm for classification Overall accuracy rate was 6 8.2%. J.Sirisha Devi et al. [7]present text dependent speaker recognition with an enhancement of detecting the emotion of the speaker prior using the hybrid FFBN and GMM methods. Firoz Shah et al.[8]proposed a novel feature extraction method based on Linear Predictive Coefficients (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for the emotion recognition from speech. Classification and recognition of feature was done by ANN . Malayalam (one of the south Indian languages words were used for the experiment. The recognition accuracy of 79% was achieved from the experiment. Marius Vasile Ghiurcau, Cornel iu Rusu, Jaakko-Astola, presented the system to study the effect of emotional state upon text independent speaker identification using MFCC and GMM. Wouter Gevaert, et al perform an investigation that is done on performance for classification of speech recognition. Two standard neural networks structures are used for performance evaluation as classifiers. Feed-forward Neural Network (FFNN) type is included for standard utilization with back propagation algorithm and Basis Functions Neural Networks. Jia-Ming Liu et al. [9] classify the Cough signals of the patients takingGTCC as the features of the cough signal. They compare the GTCC features with MFCC features of the cough signal. In this Paper weighted SVM is used as classifier they showed that GTCC features surpassed the MFCC features. Jagvir et al [10] presented a Speaker emotion recognition system. They extract the MFCC features and classify them with back Propagation Neural Network. They classify the 4 emotions Sad, happy, angry, fear.

III. METHODOLOGY

Our proposed Research Methodology has two Phases: Training and the Testing phase. There are different Sequential steps that are performed under these two phases. The Speech Signals of different Speakers with 5 emotions are to be collected..The Flow Chart of the Research Methodology describes the two sections –the training and the testing .In the training section the system is trained for the classification of the emotions .After training the System has to be tested for the validation of the system. This will determine the accuracy of the emotion recognition System. Training Section –Following Steps is to be performed in this section.

Figure 1: Flow Chart of the Research Methodology

Step 1: Upload the wave file of the speech Signal form the DatabaseStep 2: Perform preprocessing functions. Preprocessing is the process of removing noise signals from the Speech signals as the Speech Signals in our Database are recorded by the Standard microphone. So it may contain noise signals also.

Step 3: GFCC Feature ExtractionWindowingThe original audio signals are windowed into the frames. The default frame length is 25ms and the default overlapping time is 10ms Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. Fast Fourier Transform (FFT)After windowing the signal then FFT (fast fourier transformation) is applied to each frame to analyze the spectrum. . The tool for extracting spectral information for discrete frequency bands for a discrete-time signal is the Discrete Fourier Transform or DFT.Gammtone Filter BankThen Gammatone filter bank is applied to the each Fast fourier transformed Signal. Gammatone filter-bank is a group of filters The impulse response of a Gammatone filter is similar to the magnitude characteristics of a human auditory filter. The auditory filter’s bandwidth is the value of ERB (Equivalent Rectangular Bandwidth) centered at frequency.

Page 3: GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH Gurpreet Kaur1, Abhilash ... classifiers …

Global Journal of Advanced Engineering Technologies, Vol3, Issue4- 2014 ISSN (Online): 2277-6370 & ISSN (Print): 2394-0921

www.gjaet.com Page | 447

Discrete Cosine TransformationDiscrete cosine transform are applied to model the human loudness perception and decorrelate the logarithmic compressed filter outputs. After Applying DCT we get the GFCC (Gammatone Frequency Cepstral Coefficient) features.

Figure 2: GFCC feature Extraction

The features are stored in the vectors. Theses vectors are stored in the Matlab Database.

Step 4: Classification We have designed the Network such that it has three layers input layer and second is hidden layer and other is output layer. In back-propagation neural networks works in reverse direction i.e back propagates and find errors and update weights accordingly. In input layer vectors that are pre-processed are presented and at output layer calculation of error performed. If error is found at output layer then it comes on input layer by backward process. This process is continues. This form an iteration process. At end of every iteration test patterns are presented to neural network, and the prediction performance of network is evaluated.

Testing section: Step 1: In testing section the wave File Is uploaded Step 2: Feature extraction of the wave signal is done using GFCC. After extracting the GFCC features, the optimal features are to be selected. For the optimization purpose Bacterial forging Optimization (BFO) algorithm is applied because it increases the efficiency of the Neural Network. We have applied the Bacterial forging Optimization algorithm for feature optimization .in emotion Recognition System. Step 3: The Classification of the emotion is done using BPNN Classifier.

IV. RESULTS

Emotion/Parameters

SNR MSE Accuracy

SAD 46.348 0.404226 98.328JOY 40.370 0.45305 98.310AGGRESSIVE 48.33 0.490486 98.4428FEAR 46.07 0.40367 99.0572SURPRISE 49.83968 0.58236 97.389

Figure 3: Table showing the average SNR, MSE, Accuracy of the emotions in 10 tested Speech signal for each category

Figure 4: Graph showing the SNR w.r.t number of files

Figure5: Graph showing the Average MSE w.r.t. number of files of speech signals

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

40

45

50

NUMBER OF FILES

AV

ER

AG

E S

IGN

AL

TO N

OIS

E R

ATI

O

GRAPH

AGGRESIVEFEARJOYSADSURPRISE

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

NUMBER OF FILES

AV

ER

AG

E M

EA

N S

QU

AR

E E

RR

OR

GRAPH

AGGRESIVEFEARJOYSADSURPRISE

Page 4: GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH Gurpreet Kaur1, Abhilash ... classifiers …

Global Journal of Advanced Engineering Technologies, Vol3, Issue4- 2014 ISSN (Online): 2277-6370 & ISSN (Print): 2394-0921

www.gjaet.com Page | 448

Figure 6: Graph Showing the average Accuracy for each emotion in 10 speech signals for each Category

Category SAD JOY AGGRESIVE FEAR SURPRISE

SAD 8 0 0 2 0

JOY 0 8 1 0 1

AGGRESSIVE 0 0 9 0 1

FEAR 0 0 0 10 0

SURPRISE 0 1 1 1 7

Figure 7: Confusion matrix for the Recognition rate of the tested signals

Emotion Sad Joy Aggressive Fear

MFCC and BPNN

97 98 98 99

GFCC and BPNN

98.328 98.810 98.5428 99.0572

Figure 8: The table showing the comparison of proposed System with MFCC /BPNN technique

Figure 9: Graph Showing the Comparison between two systems

V. CONCLUSIONSIn this paper we have focused on the emotion detection in speech. We applied the GFCC algorithm for the extraction of the features of the speech signal. For the Feature Reduction or optimization, BFO has been applied in this paper. Five emotions have been recognized and the proposed system has recognized the emotion in speech with better rate than the MFFC/ BPNN technique. The Classification of the emotion in the proposed system has been done by Back propagation Neural network and neural network shows very good accuracy.

VI. ACKNOWLEDGEMENTI express my sincere gratitude to my guide Mr.Abhilash Sharma, and to Head of CSE department Dr. Anuj Gupta for his valuable guidance and advice. Also I would like to thank all the people who have given their heart welling support in making this completion a magnificent experience.

REFERENCES1) A. Milton , S. Sharmy Roy ,S. Tamil Selvi “ SVM Scheme for Speech Emotion Recognition using MFCC Feature “published in International Journal of Computer Applications (0975 – 8887) Volume 69–No.9, May 2013 pp 34-392) Balaji Vasan Srinivasan, Yuancheng Luo, Daniel Garcia- Romero, Dmitry N. Zotkin, and Ramani Duraiswami,”A Symmetric Kernel Partial Least Squares Framework for Speaker Recognition”, IEEE Transactions On Audio Speech, And Language Processing, Vol. 21, No. 7, July2013 3) Martinez. ;Perez, H. ;Escamilla, E. ; Suzuki, M.M.” Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques “in Electrical Communications and Computers (CONIELEC OMP), 2012, 22nd

International Conference.4) Garg, Vipul, Kumar, Harsh; Sinha, Rohit,”Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers”in Communications (NCC), 2013 National Conference5) Prof .Sujata Pathak , Prof .Arun Kulkarni ,” Recognizing motions from speech”, IEEE Transactions On Audio, Speech, And Language Processing , 2011 IEEE.6) N. Murali Krishna, P.V. Lakshmi, Y. Srinivas J.Sirisha Devi, “Emotion Recognition using Dynamic Time Warping Technique for Isolated Words”, IJCSI International Journal Of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011.7) J. Sirisha Devi , Y. Srinivas and Siva Prasad Nandyala Automatic Speech Emotion and Speaker Recognition based on Hybrid GMM and FFBNN” in International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.1, February 2014.

0 1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

NUMBER OF FILES

AV

ER

AG

E A

CC

UR

AC

YGRAPH

95.596

96.597

97.598

98.599

99.5

MFCC and BPNN

GFCC and BPNN

Page 5: GFCC BASED FEATURE EXTRACTION FOR THE … BASED FEATURE EXTRACTION FOR THE CLASSIFICATION OF EMOTIONS IN SPEECH Gurpreet Kaur1, Abhilash ... classifiers …

Global Journal of Advanced Engineering Technologies, Vol3, Issue4- 2014 ISSN (Online): 2277-6370 & ISSN (Print): 2394-0921

www.gjaet.com Page | 449

8) Firoz Shah.A, Vimal Krishnan V R, Babu Anto.P”Emotion Recognition From Malayalam Words Using Artificial Neural Networks”in 2009 IEEE International Advance Computing Conference (IACC 2009) Patiala, India, 6–7March 2009 pp-3717-3719.9) Jia-Ming Liu1, Mingyu You1*, Guo-Zheng Li1, Zheng Wang1, Xianghuai Xu2, Zhongmin Qiu2, Wenjia Xie1, Chao An1, Sili Chen1 “cough signal recognition with Gammatone cepstral coefficients “in Signal and information Processing 2013 IEEE China Summit and international Conference on 6-7 july 201310) Jagvir Kaur, Abhilash Sharma” a review of automatic speech emotion recognition “published in International Journal of Advanced and Innovative Research (2278-7844) / # 308 / Volume 3 Issue 411) N. Murali Krishna, P.V. Lakshmi, Y. Srinivas J.Sirisha Devi, “Emotion Recognition using Dynamic Time Warping Technique for Isolated Words”, IJCSI International Journal Of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011.12) Krishna Mohan Kudiri, Gyanendra K Verma and Bakul Gohel,” Relative amplitude based feature for emotion detection from speech”, IEEE Transactions On Audio, Speech, And Language Processing , 201013) David A. van Leeuwen and Rahim Saeidi,” Knowing The Non-Target Speakers: The Effect Of The I-Vector Population For Plda Training In Speaker Recognition”, ICASSP 2013.14) Akshay S. Utane, Dr. S. L. Nalbalwar, “Emotion Recognition through Speech Using Gaussian Mixture Model and Support Vector Machine”, International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013.15) Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana, ”Fundamental of Speech Recognition ”, Prentice-Hall, Englewood Cliffs, 2009.

AUTHORS PROFILE

Gurpreet Kaur is a Research Scholar of department of Computer Science and Engineering. She is currentlypursuing her M.Tech in Computer Science and Engineering from RIMT-IET, Mandi Gobindgarh, and Punjab.

She did her B.Tech in Computer Science and Engineering in 2012 from BBSBEC, Fatehgarh Sahib. She has written 02 research papers which are published in international journals. Her main research interests are in Human Computer Interfaces.

Abhilash Sharma, a highly qualified Assistant Professor with 7 years teaching experience, holds a master's degree in computer science. He is currently serving as an Assistant Professor in computer science department of RIMT-IET, Mandi Gobindgarh, and Punjab.

He has written many technical papers and is a regular contributor to industry periodicals. His current research focuses primarily on the application of mobile Adhoc networks.