A DTW-based probability model for speaker feature analysis and data mining

A DTW-based probability model for speaker featureanalysis and data mining

Jingwei Liu *, Qiansheng Cheng, Zhongguo Zheng, Minping Qian

Department of Information Science, School of Mathematical Science, Peking University, Beijing 100871, China

Received 14 June 2001

Abstract

This paper is a contribution to probabilistic data mining and pattern recognition. A DTW-based statistical model is

proposed to explore the subspace structures of speaker feature space for feature evaluation, dimension reduction and

inter-class information discovery in pattern space.We demonstrate its usefulness in isolated digits speaker identification,

and the performance of the statistical model is compared with standard DTW recognition rate in the experiment. We

argue that the probability model can be taken as data mining tools. � 2002 Elsevier Science B.V. All rights reserved.

Keywords: Feature selection; Speaker identification; DTW; Data mining; Statistical learning

1. Introduction

Feature selection and dimension reduction havelong been the key problem in pattern recognition(Duda and Hart, 1973; Fukunaga, 1990), espe-cially in acoustic processing by virtue of large scalesample data (Sambur, 1975, 1976; Charlet andJouvet, 1997; Kanedera et al., 1997; van Vuurenand Hermansky, 1998). Many techniques havebeen proposed to solve the above problem suchas F-ratio, divergence, EER, HMM, GA, neuralnetwork, fuzzy, etc. (Sambur, 1975, 1976; Furui,1986; Fukunaga, 1990; Campbell, 1997; Charletand Jouvet, 1997; Kanedera et al., 1997; vanVuuren and Hermansky, 1998; Goldberg David,

1989; Jayanta et al., 1998). The present paperproposes a novel general statistical model forprobabilistic data mining of feature componentimportance and dissimilarity of classes in patternspace.The model is motivated by how to describe the

similarity and dissimilarity of the classes for fea-ture selection (Fukunaga, 1990) and the explora-tion of data structure in the speaker recognition(Liu, 1999). As the traditional criteria of featureselection, F-ratio and divergence, have two defi-ciencies in acoustic feature evaluation: Firstly, theytake the feature vectors as independent vectorsin the Euclidean distance space, and ignore theacoustic or speaker information in the featurevectors within one utterance; Secondly, they donot work in multi-cluster case (Fukunaga, 1990;Campbell, 1997). To overcome the above short-comings, we introduce the dynamic time warping

Pattern Recognition Letters 23 (2002) 1271–1276

www.elsevier.com/locate/patrec

*Corresponding author.

E-mail address: [email protected] (J. Liu).

0167-8655/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-8655 (02 )00068-5

(DTW) distance into the text-dependent featureevaluation and define the ratio of the max–mindistance of intra-class and the minimum distanceof the inter-class to describe the dissimilarity of anytwo classes (Wilpon and Rabiner, 1985; Rahmanand Fairhurst, 1997; Rabiner and Juang, 1993).Since the available data points are limited in thepractical pattern recognition task, sometime localand disjoint, the ratio is a metric of the conjunc-tion of intra-class samples in comparison with in-ter-class distance.Based on the sampling theory, we construct two

discrete-valued random variables of statistic-1 andstatistic-2 and two following criteria on the featuresubspaces, such that statistic-1 represents the dis-similarity of any r classes and statistic-2 indicatesthe comparison of different features. Theoretically,the model above has twofold functions: for thefixed r, the statistic-1 evaluates the performance ofdifferent features; however, for the given featureand the different sampling styles r, the statistic-1reveals the statistical information of different rclasses in the given feature space contrariwise.Then, we investigate the dimension importance

of 16-order LPCC and 16-order MFCC with thestatistical model and (l� r) algorithm (Fukunaga,1990; Pandit and Kittler, 1998) in the text-depen-dent speaker identification simulation whereafter,and the corresponding DTW recognition rates arealso given in the experiment.This paper is composed as follows: Section 2

gives the probability model of statistic and crite-rion. Section 3 gives the experiment results ofstatistics for feature selection. The recognitionrates of DTW are also given in Section 3. And,discussions and conclusions are given in Section 4.

2. Construction of probability model

2.1. DTW measure and standard template

The most effective measure of any two utter-ances is DTW technique (Furui, 1986, 1997;Rabiner and Juang, 1993; Yang and Chi, 1995;Campbell, 1997), that is, for any two utterancesR ¼ ðRð1Þ; . . . ;RðNÞÞ, and T ¼ ðT ð1Þ; . . . ; T ðMÞÞ,where RðiÞ and T ðjÞ are both K dimensional vec-

tors, and RðiÞ ¼ ðrið1Þ; . . . ; riðKÞÞ, T ðiÞ ¼ ðtið1Þ;. . . ; tiðKÞÞ. The DTW distance of R and T is

dDTWðR; T Þ ¼ minUð�Þ

XN

ni¼1;mi¼UðniÞdðni;miÞ;

dðni;miÞ ¼XK

j¼1w2j ðriðjÞ � tiðjÞÞ2;

ð1Þ

where mi ¼ UðniÞ is the lattice path from ð1; 1Þ toðN ;MÞ, with the path’s slope ranging from 1=2 to2, and two frames relaxed with the final point(Yang and Chi, 1995), and wj P 0; i ¼ 1; . . . ;K.Here, we choose fwj ¼ jgKj¼1 for LPCC (Juang

et al., 1987) and MFCC, and fwj ¼ 1gKj¼1 forPARCOR.The reference template is calculated as the

average (along the DTW path) of one person’sutterances (Rabiner and Juang, 1993; Pandit andKittler, 1998).

2.2. Sampling on feature space

Suppose there are L classes in K dimensionalfeature space. The standard method is to make acriterion on the feature space for feature evalua-tion, such as F-ratio and divergence, etc. Obvi-ously, all the combination of feature componentis 2K � 1, this is a typical NP-hard problem (DeBacker et al., 1998). On the assumption that eachcomponent is independent, Samber, Furui andothers proposed to study the dimension individu-ally, Fukunaga studied the optimized (l� r)algo-rithm which selected the dimension combining thewarping of the forth selection and the back selec-tion algorithm.If the classes information are considered, the

problem is also complex, since it has 2L � 1 pos-sibilities.Then, we considered any r fixed classes,where 16 r6 L. The distribution of given r classesreveals the effect of any fixed feature components.Suppose the L classes are X ¼ fC1; . . . ;CLg:

Given r (16 i6 L) , denote Xr ¼ fx1;x2; . . . ;xCrLg,

where xi; 16 i6CrL denote the random sampling

on X. Since r ¼ 1 is the traditional samplingmethod, r ¼ 2 is the traditional two-class case, forany r, the sampling method presented above is thegeneralized sampling style. Therefore, for given r,

1272 J. Liu et al. / Pattern Recognition Letters 23 (2002) 1271–1276

we would get all the subspaces of any r classes inthe feature space.Let Mi denote the sample number of Ci, wherePLi¼1Mi ¼ M . For any class, in order to guarantee

the conjunction of its samples, we define the intra-class measure,

dIntraðCiÞ ¼ maxf16 k6Mig

minf16 j 6¼i6Mig

dDTWðxðkÞ; xðjÞÞ;

ð2Þwhere xðkÞ; xðjÞ 2 Ci; i ¼ 1; . . . ; L;

the inter-class metric of any two classes,

dDTWðCi;CjÞ ¼ minf16 k6Mig

minf16 l6Mjg

dDTWðxðkÞ; x0ðlÞÞ;

ð3Þwhere xðkÞ 2 Ci; x0ðlÞ 2 Cj; i; j ¼ 1; . . . ; L;

the inter-class metric of one class from any otherL� 1 classes,

dInterðCiÞ ¼ minf16 j 6¼i6 Lg

dDTWðCi;CjÞ; i ¼ 1; . . . ; L:

ð4ÞFor xi 2 Xr, we define the intra-class and inter-class distances

IntrascalarðxiÞ ¼1

r

Xr

j¼1dIntraðCijÞ ð5Þ

InterscalarðxiÞ ¼ maxf16 j6 rg

dInterðCijÞ ð6Þ

T ri ¼ IntrascalarðxiÞ

InterscalarðxiÞ; 16 i6Cr

L: ð7Þ

Let T r be a discrete-valued variable defined on fT r1 ;

T r2 ; . . . ; T

rCrLg, in short,

T r ¼ fT r1 ; T

r2 ; . . . ; T

rCrLg: ð8Þ

Then, T r’s distribution reflects the divergence ofany r classes.

Criterion 1.

1. The less the EðT rÞ, the better the feature.2. The larger the FT rð1Þ ¼ P ðT r

6 1Þ, the better thefeature. In the graph, the higher the intersectionof the distribution and the line y ¼ 1:0, the bet-ter the feature.

Criterion 1 evaluates the feature set by the value ofthe EðT rÞ and FT rð1Þ on Xr. Moreover, we pro-pose the following statistic to describe the detailedstructure of corresponding two features on thesame Xr,

Rri ¼

T ri

T 0ri; 16 i6Cr

L; ð9Þ

where T ri and T r

i are two values of statistic of twodifferent features of random event xi.Similar to the definition of T r, let

Rr ¼ fRr1;R

r2; . . . ;R

rCrLg: ð10Þ

Then, the distribution of Rr reflects the local anddynamic inter-class divergence effect of differentfeature.

Criterion 2.

1:PðRr > 1ÞP ðRr 6 1Þ < 1; the first feature is better

than the second feature:

PðRr > 1ÞP ðRr 6 1Þ > 1; the second feature is better

than the first feature:

PðRr > 1ÞP ðRr 6 1Þ ¼ 1; no distinguish:

ð11Þ2. The larger the FRrð1Þ ¼ P ðRr

6 1Þ, the better thefeature.

The above criterion can be used to distinguishdifferent features under the same subspace. In thestatistical learning view, EðT 1Þ is equivalent to theF-ratio under DTW distance. Therefore, EðT 1Þ willbe taken as a criterion for feature selection, andEðT 2Þ is a dissimilarity of any two classes similar tothe dissimilarity defined in (Rahman and Fair-hurst, 1997).

3. Experiments and conclusions

3.1. Database

The experiment has involved totally 4116 ut-terances of 10 isolated English digits ð0; 1; . . . ; 9Þ

J. Liu et al. / Pattern Recognition Letters 23 (2002) 1271–1276 1273

of TI46, belonging to 16 persons (8 males and8 females), and 1584 utterances for training,while 2532 utterances for test, respectively. Thespeech signal was sampled with 12,500 Hz and 16bit quantified. After pre-emphasis with HðZÞ ¼1� 0:95 Z�1, each utterance’s short-time frameanalysis was over contiguous speech frames of20.48 ms (256 points) and overlapping 10.24 ms(128 points). Multiplied by a Hamming window,each frame is extracted by 12-order LPC, 12-orderPARCOR, and 16-order MFCC. Then the 12-order LPC is used to calculate 16-order LPCC.Features extracted from train set are used to buildthe statistical model, and features extracted fromtest set are used to calculate the recognition rate.

3.2. Model application to feature extraction

In the experiment, Criterion 1 of EðT 2Þ is ap-plied to compare the effects of 16-order LPCCand 16-order MFCC in speaker identification,where fwi ¼ 1g12i¼1, for 12-order PARCOR, andfwi ¼ ig16i¼1 (Juang et al., 1987) for both 16-orderLPCC and 16-order MFCC.First, the statistical models of 12-order PAR-

COR, 16-order LPCC and 16-order MFCC areconstructed, respectively. Then, we calculate thedistributions and mathematical expectations oftheir T r. Since the sampling isomorphism, we cal-culate the average EðT rÞ and FT rð1Þ of 10 voices.Finally, the reference templates are calculated ineach person’s feature space of train set for eachvoice, and the recognition rate is obtained fromtest set to train set.Table 1 shows that EðT 2Þ and FT 2ð1Þ coincide

with the recognition rate.To exploit the subtle effects of different fea-

tures in the same subspace, we take the subspace

of 12-order PARCOR as a reference and comparethe effects of 16-order LPCC and 16-order MFCCreferred to 12-order PARCOR (see Table 2).Referred to the 12-order PARCOR, 16-order

LPCC is better than 16-order MFCC.The experiment demonstrates that Criteria 1

and 2 show the same result of recognition rate. Forthe fixed feature, Criterion 1 shows the informa-tion of inter-class, while the sampling way is fixed,Criterion 1 shows the information of differentfeatures.

3.3. Feature selection

The feature component importance, namely di-mension importance, is discussed in this section ofMFCC and LPCC, respectively.In the feature component research, the standard

method is to make a criterion and select the featureset by the criterion. The most famous algorithm is(l� r) algorithm which combines the forth andback warping algorithm. While the selected set isconsidered, the rejected set ought to be consideredsimultaneously. Figs. 1 and 2 show that for 16-MFCC and 16-LPCC, with the weight fwi ¼ ig16i¼1,the dimension importance is similar and F-ratio cannot reflect the dimension importance forspeaker recognition.

Table 1

Average EðT 2Þ, FT 2 ð1Þ and DTW recognition rate

Feature Average

EðT 2ÞAverage

FT 2 ð1ÞAverage recog-

nition rate (%)

MFCC 0.7372 0.8900 89.31

LPCC 0.7216 0.9108 92.66

PAR 0.7615 0.8817 82.90Fig. 1. F-ratio of 16-order MFCC and the recognition rate of

one dimension and one dimension omitted.


The optimized dimension selection is given ac-cording to EðT 1Þ and (l� r) algorithm, and theresults (Figs. 3 and 4) show that:

When the optimized number of feature sets is6, the recognition rates of MFCC and LPCC are84:85% and 86:09%, respectively; when the opti-mized number is 11, the MFCC recognition rateis 89:50% better than 16-order MFCC which is89:31%, and when optimized number is 13, theMFCC recognition rate gets 90:36% better than16-order MFCC totally.

Since the criterion of statistic is constructed in thefeature of train set individually, the dilemma oftraining and testing result in that the EðT 1Þ cannotreduce strictly (Figs. 3 and 4), but with the (l� r)optimized algorithm, EðT 1Þ select the feature setcoinciding with the recognition rate.The result above shows that in the clean envi-

ronment, with the weight of fxi ¼ ig16i¼1 and the(l� r) optimizing algorithm, MFCC can en-hance recognition rate less than 13 dimensions, butLPCC cannot enhance the recognition rate withless dimensions. Statistically, within 2% enhance-ment, the dimension importance is equal.

4. Conclusion

In the paper, a probability model for DTW-based feature analysis and inter-class informationdiscovery has been presented, which overcomes thedeficiency of traditional F-ratio in acoustic featureanalysis. The simulation shows the performance ofthe model in speaker feature analysis, and its datamining ability for feature component importance.The further research will focus on the analysis withdifferent weights, and different combined features,and combine the neural network and optimiza-tion algorithm into the statistical frame to enhancethe model efficiency. And, using the discovered

Fig. 3. The optimized dimension number of 16-order MFCC

according to EðT 1Þ and (l� r) algorithm, and the correspond-ing recognition rate.

Fig. 4. The optimized dimension number of 16-order LPCC

according to EðT 1Þ and (l� r) algorithm, and the correspond-ing recognition rate.

Fig. 2. F-ratio of 16-order LPCC and the recognition rate of

one dimension and one dimension omitted.

J. Liu et al. / Pattern Recognition Letters 23 (2002) 1271–1276 1275

knowledge to construct the multi-feature, multi-template recognition system, to enhance the wholeDTW-based system performance. Clearly, thetheory and methodology can also be applied toDTW-based speech feature analysis and DTW-based image feature extraction.

Acknowledgements

The author would like to thank Professor Hu-isheng Chi for his warm instruction and providingwork condition in Center for Information Science,as well as Professor Yingshan Zhang for his valu-able discussion.

References

Campbell, J.P., 1997. Speaker recognition: a tutorial. Proc.

IEEE 85 (9), 1437–1462.

Charlet, D., Jouvet, D., 1997. Optimizing feature set for

speaker verification. Pattern Recognition Lett. 18, 873–879.

De Backer, S., Naud, A., Scheunders, P., 1998. Non-linear

dimensionality reduction techniques for unsupervised fea-

ture extraction. Pattern Recognition Lett. 19, 711–720.

Duda, R., Hart, P., 1973. Pattern Classification and Scene

Analysis. Wiley, New York.

Fukunaga, K., 1990. Introduction to Statistical Pattern Rec-

ognition, second ed. Academic Press, London.

Furui, S., 1986. Research on individuality features in speech

waves and automatic speaker recognition techniques. Speech

Commun. 5 (2), 9–22.

Furui, S., 1997. Recent advances in the speaker recognition.

Pattern Recognition Lett. 18 (9), 859–872.

Goldberg David, E., 1989. Genetic Algorithms in Search,

Oprimization and Machine Learning. Addison-Wesley.

(Hardcover).

Jayanta, B., Rajat, K., Pal, De.S.K., 1998. Unsupervised

feature selection using a neuro-fuzzy approach. Pattern

Recognition Lett. 19, 997–1006.

Juang, B.H, Rabiner, L.R., Wilpon, J.G., 1987. On the use

of bandpass liftering in speech recognition. IEEE Trans.

Speech Signal Processing ASSP-35 (7), 947–954.

Kanedera, N., Takayuki, Hermansky, H., Pavel, M., 1997. On

the Importance of components of Various Moidulation

Frequencies for Speech Recognition. In: Proc. EURO-

SPEECH’97, Rhodes, Greece.

Liu, J.W.,1999. Exploratory data analysis and speaker’s feature

analysis. M.S. thesis, Peking University.

Pandit, M., Kittler, J., 1998. Feature selection for a DTW-

based speaker verification system. ICASSP98.

Rabiner, L.R., Juang, B.H., 1993. Fundamentals of Speech

Recognitions. Prentice-Hall, Englewood cliffs, NJ.

Rahman, A.F.R, Fairhurst, M.C., 1997. Selective partition

algorithm for finding regions of maximum pairwise dissim-

ilarity among statistical class models. Pattern Recognition

Lett., 18605–18611.

Sambur, M.R., 1975. Selection of acoustic features for speaker

identification. IEEE Trans. ASSP-23 (2), 176–182.

Sambur, M.R., 1976. Speaker recognition using orthogonal

linear prediction. IEEE Trans. ASSP-24 (4), 283–289.

van Vuuren, S., Hermansky, H., 1998. On the Importance

of Components of the Modulation Spectrum for Speaker

Verification, ICSLP, Sydney Australia, November.

Wilpon, J.G., Rabiner, L.R., 1985. A modified K-means clus-

tering algorithm for use in isolated word recognition. IEEE

Trans. Acoust. Speech Signal Proc. ASSP-33 (3), 587–594.

Yang, X.J., Chi, H.Sh., 1995. Digital Processing of Speech

Signal. Press of Electric Industry, P.R. China.


A DTW-based probability model for speaker feature analysis and data mining

Documents

Transcript of A DTW-based probability model for speaker feature analysis and data mining