Modified Mel-frequency Cepstral Coefficient

5
MODIFIED MEL-FREQUENCY CEPSTRAL COEFFICIENT Goutam Saha Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, West Bengal, India. email:[email protected] Ulla S. Yadhunandan Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, West Bengal, India. email:[email protected] ABSTRACT A new modification of Mel-Frequency Cepstral Coffi- cient(MFCC) feature has been proposed for extraction of speech features for Speaker Verification(SV) application. This is compared with original MFCC based feature extrac- tion method and also on one of the recent modification. The work uses multi-dimensional F-ratio as performance mea- sure in Speaker Recognition(SR) applications to compare discriminative ability of different multi parameter methods. KEY WORDS Speaker Verification, MFCC, F-Ratio, City block distance, Speaker Recognition, Combined F-Ratio 1 Introduction The goal of Automatic Speaker Verification (ASV) process is to discriminate among speakers. The effectiveness of the speaker verification depends mainly on the accuracy of discrimination of speaker models developed from speech features. The features extracted for the verification pro- cess must possess high discriminative power. The earlier feature extraction methods used Linear Predictive Coeffi- cients but was found not efficient enough for ASV process [1]. Furui suggested the use of cepstrum [2][4], the inverse Fourier transform of the logarithm of the magnitude spec- trum. The use of the cepstrum allows for the similarity between two cepstral feature vectors to be computed as a simple Euclidean distance. Cepstral features were found to separate intraspeaker variability based on age, emotional status of an individual from interspeaker variability[10]. Further improvement was made in using Mel-Frequency Cepstral Coefficients (MFCC) approach[8] where psychoa- coustic modeling of human auditory system is emulated. MFCC uses mel-scaled log filter bank of energies to extract speech features. The filter bank analysis method uses a band of over- lapped filters that approximate the frequency response of the basilar membrane in cochlea in the human auditory sys- tem. Here, spectral envelope is computed from output en- ergy of bank of filters. The spectral envelope estimated from the filter banks tends to be influenced by the speaker characteristics, background noise, channel distortion, vo- cal effects etc. To suppress this undesired influence, and to make more reliable spectral comparison, this paper pro- poses a Modified MFCC technique. Modification in MFCC technique is attempted before providing different weights in different energy bands with some success[7]. The present work proposes a conceptu- ally different modification technique with further perfor- mance improvement. Uni-dimensional F-Ratio is used be- fore[11] at parameter level comparison in SR applications, but the present work of feature extraction shows use of multi-dimensional F-Ratio to compare different methods each generating multiple coefficients. It is to be noted that ‘Speaker Verification (SV)’ is a subset of a Speaker Recognition (SR) problem where the identity of claim of the person is accepted or rejected based on the degree of match of the features extracted from the speech samples. The other class of problem under SR is ‘Speaker Identification(SI)’, where based on the high- est match score of features the speaker is identified from a group of speakers. In both the problems feature extracted from the speech sample should have high discriminative ability to correctly verify(SV) or identify(SI) a speaker. The present work is useful for both Speaker Verification and Speaker Identification i.e. Speaker Recognition as a whole. The organization of this paper is as follows: Section 2 Describes briefly about the original MFCC algorithm and the concept of F-Ratio. Section 3 describes the proposed method. In Section 4 we will give the experimental results followed by conclusion. 2 Theoretical Background 2.1 Pre Processing of Speech signals The speech signal is a slowly varying signal and is often called stationary. Therefore, short-time spectral analysis is the most common way to characterize the speech signal. This means that the speech signal is divided in short fixed length frames. The continuous speech signal is divided into frames where each frame consists of N samples. Very often successive frames are overlapping with each other by M samples[1]. For our experiments,we used frames of size of N = 256 with an overlap of 50% i.e M = 128. Window functions for speaker recognition purpose

Transcript of Modified Mel-frequency Cepstral Coefficient

Page 1: Modified Mel-frequency Cepstral Coefficient

MODIFIED MEL-FREQUENCY CEPSTRAL COEFFICIENT

Goutam SahaDepartment of

Electronics and Electrical Communication EngineeringIndian Institute of Technology, KharagpurKharagpur-721302, West Bengal, India.

email:[email protected]

Ulla S. YadhunandanDepartment of

Electronics and Electrical Communication EngineeringIndian Institute of Technology, KharagpurKharagpur-721302, West Bengal, India.

email:[email protected]

ABSTRACTA new modification of Mel-Frequency Cepstral Coffi-cient(MFCC) feature has been proposed for extraction ofspeech features for Speaker Verification(SV) application.This is compared with original MFCC based feature extrac-tion method and also on one of the recent modification. Thework uses multi-dimensional F-ratio as performance mea-sure in Speaker Recognition(SR) applications to comparediscriminative ability of different multi parameter methods.

KEY WORDSSpeaker Verification, MFCC, F-Ratio, City block distance,Speaker Recognition, Combined F-Ratio

1 Introduction

The goal of Automatic Speaker Verification (ASV) processis to discriminate among speakers. The effectiveness ofthe speaker verification depends mainly on the accuracy ofdiscrimination of speaker models developed from speechfeatures. The features extracted for the verification pro-cess must possess high discriminative power. The earlierfeature extraction methods used Linear Predictive Coeffi-cients but was found not efficient enough for ASV process[1]. Furui suggested the use of cepstrum [2][4], the inverseFourier transform of the logarithm of the magnitude spec-trum. The use of the cepstrum allows for the similaritybetween two cepstral feature vectors to be computed as asimple Euclidean distance. Cepstral features were found toseparate intraspeaker variability based on age, emotionalstatus of an individual from interspeaker variability[10].Further improvement was made in using Mel-FrequencyCepstral Coefficients (MFCC) approach[8] where psychoa-coustic modeling of human auditory system is emulated.MFCC uses mel-scaled log filter bank of energies to extractspeech features.

The filter bank analysis method uses a band of over-lapped filters that approximate the frequency response ofthe basilar membrane in cochlea in the human auditory sys-tem. Here, spectral envelope is computed from output en-ergy of bank of filters. The spectral envelope estimatedfrom the filter banks tends to be influenced by the speakercharacteristics, background noise, channel distortion, vo-cal effects etc. To suppress this undesired influence, and

to make more reliable spectral comparison, this paper pro-poses a Modified MFCC technique.

Modification in MFCC technique is attempted beforeproviding different weights in different energy bands withsome success[7]. The present work proposes a conceptu-ally different modification technique with further perfor-mance improvement. Uni-dimensional F-Ratio is used be-fore[11] at parameter level comparison in SR applications,but the present work of feature extraction shows use ofmulti-dimensional F-Ratio to compare different methodseach generating multiple coefficients.

It is to be noted that ‘Speaker Verification (SV)’ isa subset of a Speaker Recognition (SR) problem wherethe identity of claim of the person is accepted or rejectedbased on the degree of match of the features extracted fromthe speech samples. The other class of problem under SRis ‘Speaker Identification(SI)’, where based on the high-est match score of features the speaker is identified from agroup of speakers. In both the problems feature extractedfrom the speech sample should have high discriminativeability to correctly verify(SV) or identify(SI) a speaker.The present work is useful for both Speaker Verificationand Speaker Identification i.e. Speaker Recognition as awhole.

The organization of this paper is as follows: Section2 Describes briefly about the original MFCC algorithm andthe concept of F-Ratio. Section 3 describes the proposedmethod. In Section 4 we will give the experimental resultsfollowed by conclusion.

2 Theoretical Background

2.1 Pre Processing of Speech signals

The speech signal is a slowly varying signal and is oftencalled stationary. Therefore, short-time spectral analysis isthe most common way to characterize the speech signal.This means that the speech signal is divided in short fixedlength frames. The continuous speech signal is divided intoframes where each frame consists of N samples. Very oftensuccessive frames are overlapping with each other by Msamples[1]. For our experiments,we used frames of size ofN = 256 with an overlap of 50% i.e M = 128.

Window functions for speaker recognition purpose

Page 2: Modified Mel-frequency Cepstral Coefficient

generally are Hamming or Hanning window[2]. The con-cept here is to minimize the spectral distortion by using thewindow to taper the signal on both ends thus reducing theside effects caused by signal discontinuity at the beginningand at the end due to framing. In our experiment each frameis multiplied by Hamming window function.

yi(n) = xi(n) · w(n), 0 ≤ n ≤ N − 1 (1)

where,

w(n) = 0.54 − 0.46 cos( 2πn

N − 1), 0 ≤ n ≤ N − 1 (2)

2.2 MFCC Method

The MFCC front end is a popular choice for the state of artSV systems, as it is based on the human peripheral audi-tory system[9]. Here, we explain how MFCC features areextracted.

• Frequency Spectrum: Once we get the framed andwindowed speech signal samples. We convert theframe of N samples from time to frequency domain.We use the FFT algorithm for this conversion into fre-quency domain.

• Mel-Frequency domain:The physcoacousticalstudies have shown that human perception of thefrequency contents of sounds for speech signals doesnot follow a linear scale. Thus for each tone withan actual frequency t measured in Hz, a subjectivepitch is measured on a scale called the ‘Mel Scale’.The mel frequency scale is a linear frequency spacingbelow 1000 Hz and logarithmic spacing above 1kHz.As a reference point, the pitch of a 1 kHz tone, 40dB above the perceptual hearing threshold, is definedas 1000 Mels. Therefore we can use the followingapproximate formula for compute the Mels for agiven frequency f in Hz ;

Fmel =1000log 2

[1 +

Fhertz

1000

](3)

• Critical Band Filter:These filters are triangular inshape and overlapping in nature and are arranged lin-early in the Mel frequency domain.But when we lookat the filter arrangement in ordinary frequency domainthey are arranged as show in the Figure 1. Accordingto Davis & Mermelstein[3] for a 8 KHz sampling rate20 filters are used over a frequency range of 0−4 KHz.Of the 20 filters, in the ordinary frequency range thefirst ten filters are arranged linearly in the range 0 − 1KHz. The next 5 filters are arranged logarithmical in1KHz-2KHz range. The last 5 filters are also arrangedlogarithmical in the frequency range of 2−4KHz. Thebase of each triangular filter is defined by the centerfrequency of the neighboring filters. The filter bank

as applied in the frequency domain amounts to apply-ing those triangular filters on the frequency spectrum.A useful way of thinking about this Mel-warping fil-ter bank is to view each filter as a histogram bin, i.e.when we apply these filters on the spectrum we takethe weighted sum of the spectrum for each of the filter.As we get one weighted output for each filter, in theend we will get 20 filter bank outputs. As we want the

Figure 1. MFCC Filterbanks

coefficients to have the speaker specific ‘vocal tract’characteristics in them, we try the Cepstral represen-tation which also provides a good representation ofthe local spectral properties of the signal for the givenframe analysis. We include the cepstral characters inthe features taking the log of filter bank outputs fol-lowed by Discrete Cosine Transform to convert spec-trum back to time domain.The 20 outputs of these willbe the ‘Mel Frequency Cepstral coefficients(MFCC)’.But out of 20 MFCC coefficients that we get, we ex-clude the first coefficient since it represents the meanvalue of the input signal which carries a little speakerspecific information. So thus we have got 19 MFCCcoefficients, but out of these generally first 12 are usedfor SR applications.

2.3 F-Ratio

F-ratio is a statistical measure in the analysis of variancewhere multi-cluster data are available[8]. The calculationof F-ratio over N Speakers can be formulated as given next.

F = R

∑Ni=1

(mi − m

)2

∑Ni=1

∑Rj=1

(µj

i − mi

) (4)

In the above equation, estimated overall mean value of theparameter is represented with m, estimated mean value ofthe parameter for the ith speaker is represented with mi

and the parameter value from the jth one of R utterancesof ith speaker is represented with uj

i . The calculation of mand mi are given respectively in the above equations.

m =1K

K∑i=1

mi (5)

Page 3: Modified Mel-frequency Cepstral Coefficient

mi =1R

R∑i=1

µi (6)

Feature extraction process on an utterance may be viewedas mapping of the utterance into multidimensional param-eter space. It might also be said that, the discrimination ofspeakers increases if the statistical distributions of differentspeakers are concentrated at widely different locations inthe parameter space. In a one dimensional parameter space,the ratio of interspeaker to intraspeaker variance, which iscalled as F-Ratio, gives a good measure of the discrimina-tive performance for the evaluated feature. The higher thevalue of the F-Ratio the better is the discriminative ability.

2.4 City Block distance

City block distance gives the measure of the similarity be-tween two parameters. This distance measure has beenused already in SR applications[6]. In a Euclidean distancemeasure the effect of outliers due to spurious disturbancesare enhanced due to squaring, the city block dampens theeffect of outliers. Figure 2 shows city block distance calcu-lation of the coefficients from their average along the mag-nitude scale. The modification technique proposed in thismeasure does frame based compensation that needs dis-tance calculation and we use ‘City block distance’ for that.

Figure 2. City Block distance measurement

3 Proposed Method

The speech signal while analyzing are usually divided intoframes, we calculate the first 12 MFCC coefficients(2 −13)for each frame. The weightining function we to in-troduce is unique for each frame of an utterance for eachspeaker. The F-Ratio is used as the performance index tomeasure the effectiveness of the new method.

3.1 Frame Based Compensation

As explained in the section 2.2 in origianl MFCC methodfor every window(frame) of speech sample for an utteranceof every user, we calculate the 20 filter bank outputs andtake the log of filter output and take their DCT to get 20

MFCC coefficients. But in our method once we get the20 filter bank outputs, we proceed to calculate their aver-age. Next, we calculate the city block distance of each filterbank output from the average and then sum the log of thedistance for all filters. It is called ‘Sweep’ that is unique foreach frame(window) since it is calculated from filter bankoutput of that frame.

The sweep represents the total variation in magni-tude of the filter outputs for each frame and gives a mea-sure of the magnitude spread of the coefficients, equivalentto variance in Euclidean distance measure. We propose acompensation based on the magnitude of spread, through aframe based weighting function to preserve the speaker de-pendent information in different frames. The variation ofintensity/loudness at different segments of a spoken wordmay influence the magnitude of the coefficients affectingcluster formation in parameter space for a speaker. Theproposed method is a frame based technique to reduce theseeffects through normalization of coefficients in each frameby its total spread, so that coefficients of all the frames arebrought to same level of spread. This also minimizes ef-fect of change in background noise level in SR applicationswhere the speaker while speaking is moving from one en-vironment to another.

Thus for each frame,the calculation is as follows:

If Sk[i] be the filter bank outputs where,i = 1, 2, . . . , 20, then

avg =120

20∑i=1

Sk[i] (7)

and,

Mk[i] =∣∣∣∣Sk[i] − avg

∣∣∣∣ (8)

therefore,

Sweep =20∑

i=1

log Mk[i] (9)

So finally weighting function is defined as

W [i] = log[

Sk[i]Sweep

](10)

Original MFCC parameters as explained in section 2.2 is as

Cn =M∑i=1

(log Sk[i] · cos

(n(i − 1

2) π

M

))(11)

where,n = 1, 2, . . . ,MThe modification in above through the weighting functionwe propose gives the Modified MFCC coefficients as:

Cn =M∑i=1

(log

(Sk[i]

) · W [i] · cos(n(i − 1

2)π

2))

(12)

Earlier S.Y.Park and H.S.Kim[7] proposed a modifi-cation in MFCC which emphasizes the high energy parts

Page 4: Modified Mel-frequency Cepstral Coefficient

of the log filter bank energies in order to reduce the effectof the noise. We have included the results of the abovemethod along with original MFCC method for comparisonwith our proposed method in results section.

3.2 Combined F-Ratio

The Combined F-Ratio result that we have used as per-formance index is based on multidimensional F-Ratio. F-Ratio as explained in the section 2.3 considers one coef-ficient at a time. For our case the extended formula formultidimensional case is in use so that we can get a singleperformance score for all the features in a feature methodthat generate multiple coefficients.After modifying the ear-lier formula in equation (4) we get.

F = R

∑Mk=1

∑Ni=1

(mi

k − mk)2

∑Mk=1

∑Ni=1

∑Rj=1

(j)(k)i − mi

k) (13)

Where M is the number of coefficients. As we vary k from1 to M we consider all the coefficients performance to cal-culate the combined F-Ratio.

The combined F-Ratio of MFCC and modified meth-ods are given in the result section.

4 Result

We have used a database of 25 speakers of which 17 aremale and rest are female. Each user has uttered a combina-tional lock phrase ‘24 − 32 − 75’ and a code word ‘indianinstitute of technology’(iit) 10 times each.

In figures Fig.3 & Fig.4 we show typical variation ofsweep over a frame for two phrases-‘24 − 32 − 75’ and‘indian institute of technology’ taken from one utterance ofa speaker.

Figure 3. sweep spread for the phrase ‘24-32-75’

Figure 4. sweep spread for the phrase ‘iit’

We present the discriminative score for the first12(2 − 13) coefficients for original MFCC, Park-Kim’smethod[7], and our proposed method in Table 1 & 2 whereeach coefficient is compared column wise by unidimen-sional F-Ratio. In Table 1 combinational lock phrase re-sult is presented, while in Table 2 the code word result ispresented. Table 3 presents combined F-Ratio score whereall the 3 methods as a whole are compared based on multi-dimensional F-ratio for both the phrases to find the superi-ority of a method as a whole.

Table 1. Uni Dimensional F-Ratios for ‘24-32-75’

Coefficient MFCC Park & Kim’s ProposedNo. Method Method2 15.2666 14.4738 22.63523 20.8341 22.0373 25.89324 8.1255 8.1942 7.91655 11.4428 11.8604 10.81636 8.0833 9.3430 10.58117 16.0001 16.5587 19.13708 19.2568 23.7924 24.87409 17.3713 20.5866 23.288710 22.2010 22.2402 28.026311 10.2145 11.0794 9.842012 10.1094 12.0702 10.216513 9.5971 11.2017 11.3753

Table 2. Uni Dimensional F-Ratios for ‘indian institute oftechnology (iit)’

Coefficient MFCC Park & Kim’s ProposedNo. Method Method2 6.8917 6.9818 6.77463 6.2051 5.7592 5.96614 2.5705 2.9950 3.82875 6.5425 6.9949 6.94076 6.5271 6.4786 8.29437 11.3389 11.1448 14.25858 31.8086 33.5240 35.63899 9.0726 10.1154 10.842710 16.5513 19.2648 19.898411 19.4532 21.4549 23.289812 16.7778 19.8581 21.947813 10.2897 12.6341 12.6898

The combined result of Table 3 shows the proposedmethod performing much better than original MFCC aswell as Park and Kim’s modification. However,Park &Kim’s[7] method exhibits better performance compared tooriginal MFCC.

The result presented in Table 1 & 2 can be used ingrading the MFCC coefficients in order of their discrimi-native abilities e.g. coefficient 8 shows higher F-Ratio inboth the Tables(i.e for both the phrases) for original as well

Page 5: Modified Mel-frequency Cepstral Coefficient

Table 3. Combined F-Ratio

MFCC Park & Kim’s ProposedMethod Method

24-32-75 12.5850 13.2618 14.1792indian institute of 7.6155 8.0849 9.0238technology

as modified methods whereas coefficient no. 6 shows rel-atively low discrimination ability from its lower F-Ratioscore. Based on this criteria coefficients can be graded ac-cording to their discriminative ability and better performingcoefficients can be chosen if there is a need for reducing thecomputational cost, time and memory.

5 Conclusion

The Proposed modification in MFCC features shows en-hanced discriminative ability for the coefficients that is im-portant in Speaker Verification applications. The work alsoshows usefulness of Multi-dimensional F-Ratio as a com-parative performance measure of different Feature extrac-tion algorithms that generate multiple coefficients. Theability to grade the coefficient of a particular method in or-der of their discriminative performance helps in situationfor choosing better performing ones where there is con-straints on number of coefficients that can be included.

Acknowledgement

The work is partly supported by Department of Sci-ence & Technology, Government of India.

Reference

[1]L.R. Rabiner and B.H. Juang, Fundamentals of SpeechRecognition (Prentice-Hall, Englewood Cliffs, N.J., 1993).

[2]L.R Rabiner and R.W. Schafer, Digital Processing ofSpeech Signals, (Prentice- Hall, Englewood Cliffs, N.J.,1978).

[3]S.B. Davis and P. Mermelstein,Comparison of para-metric representations for Monosyllabic word recognitionin continuously spoken sentences, IEEE Transactions onAcoustics, Speech, Signal Processing, Vol. ASSP-28(4),August 1980, pp.357-366.

[4]Minh N. Do, An Automatic Speaker Recognition, AudioVisual communications Laboratory Swiss Federal Instituteof Technology, Lausanne, Switzerland.

[5]W.W.Hung and H.C.Wang, On the use Weighted FilterBank Analysis for the derivation of Robust MFCCs, IEEESignal Processing letters, Vol. 8, March 2001, pp 70-73.

[6]S.Ong and C.H.Yang, A Comparitive study of Text-Independent Speaker Identification using StatisticalFeatures, International Journal on Computer EngineeringManagement, Vol. 6(1), 1998.

[7]Sun Young Park and HyungSoon Kim,Modified MFCCFeatures for Speech Recognition,Proceedings of ICSP-2001, Vol. 2, pp 659-662, August 2001.

[8]J.P. Campbell, Speaker recognition: A tuto-rial,Proceedings of IEEE, Vol. 85(9), pp. 1437-1462,1997.

[9]Molau, S, Pitz, M, Schluter, R, and Ney, H., ComputingMel-frequency coefficients on Power Spectrum, Proceed-ings of IEEE ICASSP-2001, Vol. 1, pp 73-76, May 2001.

[10]P. Premakanthan and W.B. Mikhael, Speaker verifi-cation/recognition and the importance of selective featureextraction: Review, Proceedings of the 44th IEEE 2001,Midwest Symposium , Vol. 1 , pp. 14-17, Aug. 2001.

[11]Goutam Saha and Malyaban Das,On Use of SingularValue Ratio Spectrum as Feature Extraction Tool inSpeaker Recognition Application, CIT-2003, pp.345-350,Bhubaneswar, Orissa, India, 2003.