Text-Independent Speaker Identification Using Vowel Formants · current utterance –and Speaker...

Text-Independent Speaker Identification Using Vowel Formants

Noor Almaadeed1& Amar Aggoun2

& Abbes Amira1,3

Received: 16 February 2014 /Revised: 22 March 2015 /Accepted: 17 April 2015 /Published online: 5 May 2015# The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract Automatic speaker identification has become achallenging research problem due to its wide variety of appli-cations. Neural networks and audio-visual identification sys-tems can be very powerful, but they have limitations related tothe number of speakers. The performance drops gradually asmore and more users are registered with the system. Thispaper proposes a scalable algorithm for real-time text-inde-pendent speaker identification based on vowel recognition.Vowel formants are unique across different speakers and re-flect the vocal tract information of a particular speaker. Thecontribution of this paper is the design of a scalable systembased on vowel formant filters and a scoring scheme for clas-sification of an unseen instance. Mel-Frequency CepstralCoefficients (MFCC) and Linear Predictive Coding (LPC)have both been analysed for comparison to extract vowel for-mants by windowing the given signal. All formants are fil-tered by known formant frequencies to separate the vowelformants for further processing. The formant frequencies ofeach speaker are collected during the training phase. A testsignal is also processed in the same way to find vowel for-mants and compare them with the saved vowel formants toidentify the speaker for the current signal. A score-basedscheme allows the speaker with the highest matching formantsto own the current signal. This model requires less than

100 bytes of data to be saved for each speaker to be identified,and can identify the speaker within a second. Tests conductedon multiple databases show that this score-based scheme out-performs the back propagation neural network and Gaussianmixture models. Usually, the longer the speech files, the moresignificant were the improvements in accuracy.

Keywords Vowel formants . Speaker identification .

Vowel recognition . Linear predictive coding .

Mel-frequency Cepstral coefficients

1 Introduction

The term Speaker Recognition [1] consists of SpeakerIdentification – the identification of the speaker speaking thecurrent utterance – and Speaker Verification – the verificationfrom the utterance of whether the speaker is who he claims tobe. There are two types of speaker recognition, Text-depen-dent – in which the speaker is given a specific set of words tobe uttered – and Text-independent – in which the speaker isrecognised irrespective of what one is saying [2]. The currentapproach is aimed at Text-independent Speaker Identification.

A digital speech signal is a discrete-time signal sampledfrom a continuous-time signal that has been quantised uponanalog-to-digital conversion. Each sample is represented byone or more bytes (e.g. one byte for a 256-level quantisation).This digitised discrete-time signal consists of different fre-quency values which represent the audio signal. It must bepre-processed to extract feature vectors that represent individ-ual information for a particular speaker regardless of the con-tent of the speech itself. A learning algorithm generalises thesefeature vectors for various speakers during training and ver-ifies the speaker identity for a test signal during the test phase.

* Noor [email protected]

1 Department of Computer Science and Engineering, College ofEngineering, Qatar University, Doha, Qatar

2 Department of Computer Science and Technology, University ofBedfordshire, University Square, Luton LU1, 3JU, UK

3 Department of Engineering and Computer Science, University of theWest of Scotland, Paisley, UK

J Sign Process Syst (2016) 82:345–356DOI 10.1007/s11265-015-1005-5

No two digital signals are the same, even with the samespeaker and the same set of words, due to the variation inamplitude and pitch in a speaker’s voice over different record-ings. The environmental noise, the recording equipment, thespeed at which the speaker speaks, and the varying psycho-logical and physical states of the speaker, increase the com-plexity of this task. Text-independent identification furtherrequires that the speaker is free to speak any set of wordsduring testing. Therefore, the need arises for a generalisedfeature extraction strategy to extract text-independent featuresfrom a speech signal.

An audio formant refers to the frequency peaks in a speechsignal. These peaks appear with different frequencies in aspeech signal and are also called resonant frequencies. Thesefrequencies resonate according to the vocal tract of the speak-er. Vowel formants refer to the frequencies associated withvowel sounds in a language. It is well known that the two orthree of the lowest vowel formants are sufficient to distinguishbetween vowels in most cases [3]. Fundamental frequencyestimation is an essential requirement in systems for pitch-synchronous analysis, speech analysis/synthesis and speechcoding. It has been reported that fundamental frequency can

improve performance of a speech recognition system for atonal language [4] and of a speaker identification system [5].These formants correspond closely to the acoustic resonancefrequencies created by a speaker’s vocal tract and carry uniqueinformation specific to the speaker [6]. The relevance of theindividual formants and resonances have been widely studied[7–10]. Vocal resonances are altered by the articulators to formdistinguishable vowel sounds, and the peaks in the vowelspectra are the vocal formants. The term formant refers topeaks in the harmonic spectrum of a complex sound. Theyare usually associated with the formations of the speaker’svocal tract and they are essential components in the intelligi-bility of speech. The distinguishability of the vowel soundsacross vowels as well as across users can be attributed to thedifferences mainly in their first three formant frequencies [11].An illustration is shown in Fig. 1 for two different speakers(left and right). The tracheal air pressure from the lungs passesthrough the glottis to create sound with the source spectrum asshown in the topmost plot. The vocal tract response specific tothe speaker (the middle plot) attenuates this spectrum andresults in the output sound (bottom plot) from which the for-mants can be extracted.

Voc

al T

ract

Res

pons

e

Frequency

F1

F2 F3

Voc

al T

ract

Res

pons

e

Frequency

F1 F2

F3

Sou

rce

Spe

ctru

m

Frequency

Sou

rce

Spe

ctru

m

Frequency

Rad

iate

d P

ress

ure

Wav

e

Frequency

Rad

iate

d P

ress

ure

Wav

e

Frequency

Figure 1 Resonating frequenciesfor different speakers.

346 J Sign Process Syst (2016) 82:345–356

The use of the classical Mel-frequency CepstralCoefficients (MFCC) [12, 13] is likely the most popular fea-ture extraction strategy extensively used thus far for speakeridentification systems. However, MFCC does not containmuch pitch information. Speech or speaker recognition sys-tems transmit MFCC feature vectors directly to the speechrecogniser. Fundamental frequency and spectral envelope de-rived from MFCC vectors are two necessary components forspeech reconstruction [14]. Because feature vectors are a com-pact representation, optimised for discriminating between dif-ferent speech sounds, they contain insufficient information toenable reconstruction of the original speech signal [15]. Inparticular, valuable speaker information, such as pitch, is lostin the transformation. It is therefore not possible to simplyinvert the stages involved in MFCC extraction to re-createthe acoustic speech signal [12].

Therefore, in order to extract the vowel formants, the stan-dard Linear Predictive Coding (LPC) scheme is used [16].With LPC, all of the formants are extracted, with each formantportrayed in terms of three or more degree formants. Theseformants are filtered with a vowel formant filter that separatesthe vowel formants from the consonant formants.

During the training phase of the system, after processingthe training signals, a vowel formant database is created thatstores unique vowel formants for each speaker. To distinguishone speaker from another, vowel formants are tracked in thetest file and are compared with the vowel formants database.A score-based scheme is employed that assigns the currentsignal to the speaker with the highest number of matchingformants for the current test signal. This scoring scheme alsofollows a penalty rule, according to which, if a formant doesnot match the current vowel in hand from the test file, thespeaker of that vowel formant is penalised with a negativescore.

MFCC and LPC have both been analysed for comparisonin the extraction of vowel formants. It is observed that LPC ismore efficient in this task. This method is the most powerfulway of estimating formants and is computationally the mostefficient [17]. The reasons lie in the close resemblance of thisstrategy to the human vocal tract. LPC gives a recognition ratehigher than MFCC and needs much less computational time.The algorithm to perform LPC on a speech signal is muchsimpler than that for MFCC, which has many parameters tobe adjusted to smooth the spectrum, performing a processingthat is similar to that executed by the human ear. But LPC iseasily performed by the least squares method using a set ofrecursive formula [18].

For identification, the proposed score-based strategy hasbeen compared with the Back Propagation Neural Network(BPNN) [19] and the Gaussian Mixture Model (GMM) [20].GMM is a robust model for text-independent speaker identi-fication as reported in [20]. A GMM is a parametric learningmodel and it assumes the process being modeled has the

characteristics of a Gaussian process whose parameters donot change over time. When employing GMM over a seg-mented stream of speech signal, it is important that we assumethat the frames are independent. This is a reasonable assump-tion since, generally, the text-independent systems aremodeled as statistical speech parameter distribution models,which use GMM as the model of each speaker model as wellas the universal background model (UBM) [20–22].

Although speech is non-stationary, it can be assumedquasi-stationary and be processed through the short-timeFourier analysis. In speech processing the short-time magni-tude spectrum is believed to contain most of the informationabout speech intelligibility. The duration of the Hammingwindow function is an important choice. When making thequasi-stationarity assumption, we want the speech analysissegment to be stationary. We cannot make the speech analysiswindow too large, because the signal within the window willbecome non-stationary. On the other hand, making the win-dow duration too small also has its disadvantages. If it is toosmall, then the frame shift decreases and thus the frame rateincreases. This means we will be processing a lot more infor-mation than necessary, thus increasing the computationalcomplexity. Also, making the window duration small willcause the spectral estimates to become less reliable due tothe stochastic nature of the speech signal. Finally, a pitch pulse(typically with a frequency between 80 and 500 Hz) usuallyoccurs every 2 to 12ms. If the duration of the analysis windowis smaller than the pitch period, then the pitch pulse may ormay not be present. Hence, the shape and duration of theHamming window is an important design criterion.

One of the most common classes of neural networks is thefeed-forward network [19]. Back propagation refers to a com-monmethod by which these networks can be trained. Trainingis the process by which the weight matrix of a neural networkis adjusted automatically to produce desirable results. Thoughback propagation is commonly used with feed-forward neuralnetworks, it is by no means the only training method availablefor the feed-forward neural network. Back propagation worksby calculating the overall error rate of a neural network. Theoutput layer is then analyzed to see the contribution of each ofthe neurons to that error. The neurons’ weights and thresholdvalues are then adjusted, according to how much each neuroncontributed to the error, to minimise the error next time. Thereare mainly two training parameters, the learning rate and themomentum, that can be passed to the back propagation algo-rithm to customise its output. The weights in a feed-forwardneural network are adjusted according to the square errorsbetween the actual outputs and the desired outputs. For aBPNN, these errors are propagated layer-by-layer into theinput layer in the backward direction. The training input ispassed through the network a number of times to adjust theweights accordingly. The iterative process of training the net-work requires multiple passes through the network to train it

J Sign Process Syst (2016) 82:345–356 347

correctly. A BPNN lets us recognise complex patterns andsupports any number of training epochs to produce a learntclassifier for the unseen data. But this procedure has a costassociated with how much learning is required to performclassification within a predictable accuracy on the unseen da-ta. Learning, although can be a great way of improving per-formance, requires computer resources for computation.BPNN not only requires long training time but also a hugenumber of instances/patterns to become fully trained for clas-sification of the unseen data. It is observed that our score-based strategy outperforms both BPNN and GMM.

This paper is organised into six sections. Section 2 de-scribes the scheme of feature extraction through vowel for-mants and the steps to create the formants database.Section 3 highlights the steps involved in the score-basedscheme for speaker identification. Section 4 compares the per-formance results of these approaches on different speech da-tabases, and finally, Section 6 concludes with a discussion ofthe effectiveness of the proposed scheme.

2 Feature Extraction

2.1 Formant Extraction Through LPC

The fundamental idea behind speech formants is the assump-tion that an audio signal is produced by a buzzer at the end of atube that closely resembles the actual means of sound produc-tion in humans. The glottal portion produces the sound withthe help of our breath pressure and acts as the buzzer, whereasthe human vocal tract combined with the mouth constitutesthe tube.

Audio speech can be fully described by the combination ofits frequency graph and its loudness [1]. With this assumption,together with the vocal tract and mouth comprising the tube,the human voice is considered to consist of resonating fre-quencies called formants [23, 24]. LPC processes a signal inchunks or frames (20–30 ms) to extract these resonating fre-quencies or formants from the remainder of the noisy signalthrough inverse filtering [2, 25]. LPC analyses the speechsignal by estimating the formants, removing their effects fromthe speech signal, and estimating their intensity and frequency.The process of removing the formants is called inverse filter-ing, and the remaining signal after the subtraction of the fil-tered modeled signal is called the residue.

For a given input sample x[n] and an output sample y[n],the next output sample y′[n] can be predicted with the follow-ing equation,

y0n½ � ¼

X q

k¼0akxn−kð Þ þ

X q

k¼1bky n−k½ �ð Þ ð1Þ

The coefficients ak and bk above correspond to the linearpredictive coefficients. The difference between the predicted

sample and the actual sample is called the prediction errorgiven as,

e n½ � ¼ y n½ �−y0n½ � ð2Þ

and hence, y[n] can be written as,

y n½ � ¼ e n½ �−X q

k¼1bk y n−k½ � ð3Þ

The linear predictive coefficients bk are estimated using anautocorrelation method that minimises the error using least-square error reduction [26, 27].

2.2 Vowel Formant Filtering

In human speech, there are consonant formants and vowelformants, and there are noise reverberations. Of all of thesesound types, we are only interested in the vowel formants.There are twelve vowel formant sounds in the English lan-guage, as concluded by a study at the Dept. of Phonetics andLinguistics, University College London [28]. These vowelformants, together with their first, second, and third formantfrequency ranges, are listed in Table 1. The vowel formantsare filtered using these ranges.

Vowel formants and frequencies were first exhaustivelystudied and formulated by J.C. Wells in the early sixties[23]. This was one of the few approaches researchers in thespeaker identification field started investigating. In speechsynthesis [24, 29], digital filters are often used to simulateformant filtering by the vocal tract. It is well known [30] thatthe different vowel sounds of speech can be simulated bypassing a Bbuzz source^ through only two or three formantfilters. In principle, the formant filter sections are in series, asfound by deriving the transfer function of an acoustic tube[31].

The basic process of vowel formant extraction is shown inFig. 2. The speech signal, s(n) is first modulated by awindowing function, w(n). The modulation is typically ampli-tude modulation (i.e. multiplication) and the most commonlyused windowing function is the Hamming window. The resul-tant signal x(n) is fed to the LPC [32]. After performing linearpredictive coding, the LP coefficients α ¼ α1 α2 α3… αp

� �

are computed such that the error e(n)=s(n)−∑k=1p αks(n−k) is

minimised. The vector α is appended with zeros and the spec-tral envelope is extracted using discrete Fourier transform(DFT), from which the peak signal is detected. The formantscan then be acquired upon analyzing the peak signal ampli-tude and frequency. As described in Section 2.1, LPC per-forms an inverse filtering to remove the formants and extractthe residue signal. It effectively synthesises the speech signalby reversing its process of formation as depicted in Fig. 1. Thesynthesised speech signal can then be examined to find theresonance peaks from the filter coefficients as well as the


prediction polynomial. This is a robust method since it allowsthe extraction of formant parameters through simple peak de-tection as shown in Fig. 2.

There are several methods of filtering vowel formants. Itcan be done by passing it through a series of bandpass filters in

the audio frequency domain and systematically varying thefilter width and slope [33, 34], by low-pass or high-pass fil-tering in the temporal modulation domain [35, 36], or byvarying the number of audio-frequency channels in the con-text of cochlear implant simulations [37, 38]. In [36], it hasbeen demonstrated that both low-pass and high-pass filteringin the temporal modulation domain were analogous to a uni-form reduction in the spectral modulation domain.

Figure 3 shows a high-level flow chart for vowel extractionfrom speech signal. Vowels are highly periodic and have dis-tinctive Fourier representations. We passed the test samplesthrough an auto-regressive filter, and then calculated the for-mant frequencies from the spectral envelope of the LPC fil-tered vowel. The purpose of the auto-regressive model oneach window is to get the transfer function of the vocal tractand output the spectral envelope of each voice sample. Thenext step is to filter out the consonants by checking the fre-quency response of the LP filter representing the consonantsounds. Consonants usually have significantly lower magni-tudes than vowel sounds. A smoother is then used to eliminateanomalies and then output each vowel, from which the for-mants are extracted.

2.3 Vowel Database Construction

Vowel formants individually lie in specific frequency ranges,but every speaker has a unique vocal tract and produces vowelformants that are unique. During the training phase, the sys-tem is presented with speech files produced by differentspeakers. The speech files for each speaker are preprocessedwith LPC, and subsequently, these formants are filtered toextract only the vowel formants. These vowel formants aresaved for each speaker name in a database. This database isa Matlab [39] file to be used during the testing phase of thesystem.

Table 1 Vowel formant frequencies in English language [28].

Vowel Formant Mean frequency (Hz) Std. dev.

/i/ 1 285 46

2 2373 166

3 3088 217

/I/ 1 356 54

2 2098 111

3 2696 132

/E/ 1 569 48

2 1965 124

3 2636 139

/æ/ 1 748 101

2 1746 103

3 2460 123

/A/ 1 677 95

2 1083 118

3 2340 187

/Q/ 1 599 67

2 891 159

3 2605 219

/O/ 1 449 66

2 737 85

3 2635 183

/U/ 1 376 62

2 950 109

3 2440 144

/u/ 1 309 37

2 939 142

3 2320 141

/V/ 1 722 105

2 1236 70

3 2537 176

/3/ 1 581 46

2 1381 76

3 2436 231

Figure 2 Process of formant extraction. Figure 3 High-level flow chart for vowel extraction from speech signal.


3 Score-Based Scheme for Speaker Identification

The testing phase of the system requires the test signal to bepreprocessed with LPC and vowel formants filtering to extractthe unique vowel formants along with their first, second andthird formant frequencies to be compared with the vowel for-mant database constructed in the training phase of the system.

As a preliminary effort, different strategies for comparingthe test vowel formants with the known vowel formants in thedatabase were tested. These strategies included the following:

1. Both first and second formants,2. All first, second and third formants,3. Both first and third formants,4. Both second and third formants, and5. Averaging and comparison with least distance.

Extensive testing of these enumerated schemesagainst known results revealed that these strategies arenot powerful enough to yield a good accuracy, as vowelformants for the same vowel often overlap in differentspeakers. Sometimes only one of the three formantvalues overlaps, and sometimes two values overlap,with the only difference being in the frequency of thethird formant. This challenging complexity is attributedto the text-independent nature of the system, in whichwe have a speaker speaking the same vowel but in adifferent word with a slightly different formant track.

To handle this type of situation, a score-based scheme wasconceived that awards a positive score if all three formants arematched and penalises a speaker with a negative score other-wise. For a given speech signal, the three test formants arecompared against the vowel formant stored in the databasefor each speaker. For example, if Dx,k,1 is the first formantfrequency of vowel formant k stored in the database for speak-er x and Tk,1 is the first formant frequency of k in the testspeech, we conclude that they are Bmatched^ when the differ-ence between the two, i.e. Tk,1−Dx,k,1 is below a certain thresh-old εk,1. In this case, we say that the difference, diff(Tk,m−Dx,k,

m) is zero. It is almost impossible to get an absolute zerodifference. Therefore, this thresholding is an indirect form ofquantisation. The vowels had to be first recognised beforeperforming this comparison, the method of which is detailedin Section 2.2.

This quantity is computed for all three frequencies Dx,k,1,Dx,k,2, Dx,k,3 and the test score is,

X3

m¼1

diff Tk;m− Dx;k;m

� � ¼ 0→Score Sk;x� � ¼ 1 ð4Þ

X3

m¼1

diff Tk;m− Dx;k;m

� �> 0→Score Sk;x

� � ¼ −1 ð5Þ

The thresholds εk,m are selected experimentally. These twoscores aid in calculating the net score of each speaker x,against the test vowel as,

Identification kð Þ ¼ arg: Max score Sk;x� ��

for k ¼ 1…::n

ð6Þ

The proposed system consists of a formant extraction com-ponent coupled with a vowel formant filtering component andthe formant database, as shown in Fig. 4.

Our aim is to identify which acoustic parameters of thevowels (formants) depend more on the individual characteris-tics of the speaker and less on the linguistic variables.According to literature, high formants (F3 and F4) usuallyconvey individual information, while F1 and F2 are dependenton vowel quality [40–42]. Fundamental frequency (F0) is themost complex acoustic cue, being related in many languagesto vowel quality. All these formants can play an important rolein speaker identification to different extents [43, 44].

A neural network consists of multiple perceptrons com-bined in multiple layers beginning with the input layer,followed by one or more hidden layers and ending at theoutput layer. Each perceptron, which has multiple inputs, hasa weight vector associated with it. This weight vector pertainsto its set of inputs, and the weights across all perceptrons areadjusted during training to map the training samples to theknown target concepts. During the training phase, feature vec-tors extracted from the training data are fed into each of thenetworks in parallel over multiple epochs. Once the training iscomplete, they require only one pass of the input data to getthe output. The size (i.e. number of neurons) of the input layeris equal to the number of LPC features. Each neuron takes instreams of data as inputs that arise from the consecutiveframes. Some of the advanced neural networks have the sizeof the input layer enlarged to two or three adjacent frames [45]in order to get a better context dependency for the acousticfeature vectors. The number of input layers can also be chosenby multiplying the cepstral order with the total frame number[46], leading to an extremely large input layer size. But in boththe above cases, the computational times are affected due to

Test SignalLPC Based Formant

Extrac�on

Vowel Formant Filtering

Score Based Comparison with Vowel

Formant Database

Iden�fica�on Result

Figure 4 Process flow diagram for the proposed system for speakeridentification.


the increased number of hidden layers and states. If an inade-quate number of neurons are used, the network will be unableto model complex data, and the resulting fit will be poor. If toomany neurons are used, the training time may become exces-sively long [45]. In addition, the network may over fit the datadue to the large number of hidden nodes. If overfitting occurs,the network simply starts memorising the training data, there-by causing poor generalisation.We have performed some sim-ulations to test the effect of changing the size of the neuralnetwork input size. A higher number of input layer neuronscauses the number of hidden layer neurons to go up (andtherefore, the number of states), which eventually increasethe identification and training times. As for performance ac-curacy, there was no significant change observed since theNNs are designed to utilise the closed set of data in the mostoptimal manner. Hence, we concluded that the size of theinput layer is best set equal to the number of LPC features.

4 Results and Analysis

In this section, we compare the score-based scheme proposedin this paper against BPNN and GMMusing several databasessuch as YOHO, NIST, TI_digits1 and TI_digits2.

The YOHO database [47] contains a large scale, high-quality speech corpus to support text-dependent speaker au-thentication research, such as is used in secure access technol-ogy. The data was collected in 1989 by ITT under a USGovernment contract. The number of trials is sufficient topermit evaluation testing at high confidence levels. In eachsession, a speaker was prompted with a series of phrases tobe read aloud. Each phrase was a sequence of three two-digitnumbers. NIST speech databases are part of an ongoing seriesof evaluations conducted by NIST [48]. The telephone speechin this corpus is predominantly English, but also includes oth-er languages. All interview segments are in English.Telephone speech represents approximately 368 h of the data,whereas microphone speech represents the other 574 h.TI_digits1 and TI_digits2 [49] contain speechwhich was orig-inally designed and collected at Texas Instruments, Inc. for thepurpose of designing and evaluating algorithms for speaker-independent recognition of connected digit sequences. Thereare 326 speakers each pronouncing 77 digit sequences. Eachspeaker group is partitioned into test and training subsets. Forthe training set, we picked a total of 25–30 s of speech perspeaker. For some longer speech files, just one file wasenough for the training required for one speaker, whereas,for shorter file sizes, multiple files were used (e.g. 20 speechfiles each 3 s long). We used as many users’ data as the data-base would have, so that we have a good estimate over a largepopulation. Similarly, after separating the training files, weused all the remaining files for testing purposes. The speechsegments that we selected for the training phase did not seem

to have any noticeable effect on the performance. We testedthis by choosing different sets of training samples randomlyand obtained somewhat similar results at the end. We alsovaried the total size of the training samples. Below 25 s, thetraining was not sufficient but increasing above this pointbarely made any difference.

We first present some results comparing GMM and GMM-UBM to support the choice of GMM-UBM in the subsequentexperiments. The number of mixtures in the Gaussian mixtureis an important parameter when employing GMM or GMM-UBM. Gaussian mixtures are combinations of a finite numberof Gaussian distributions. They are used to model complexmulti-dimensional distributions upon learning the parametersof the mixture through various methods. A mixture ofGaussians can be written as a weighted sum of Gaussian den-sities, which increases the number of distributions incorporat-ed [50, 51]. The use of Gaussian mixture models for modelingspeaker identity is motivated by the interpretation that theGaussian components represent some general speaker-dependent spectral shapes and the capability of Gaussian mix-tures to model arbitrary densities [22]. The number of mix-tures can dictate the modeling capability of a GMM.Therefore, in Fig. 5, we show how this number affects theaccuracy of the speaker identification system. GMM-UBMclearly outperforms GMM when used for analysing vowelformants. However, Table 2 shows that the training and iden-tification times are somewhat higher in the case for GMM-UBM due to the added complexity. Nevertheless, because ofits superior performance, we choose GMM-UBM as one ofthe baseline schemes which we will compare our proposedscheme against. We use a single input implementation forGMM. The above results were obtained using the YOHOdatabase.

0 2 4 6 8 10 12 14 16 18 20 220

10

20

30

40

50

60

70

80

90

100

Number of Gaussian Mixtures

Acc

urac

y (%

)

GMM

GMM-UBM

Figure 5 Accuracy of GMM and GMM UBM with varying number ofGaussian mixtures.


In the above experiments, we also conclude that increasingthe number of Gaussian mixtures indefinitely does not neces-sarily increase the system accuracy. Upon further investigation,we found that many of the mixtures could be reduced to singlepoints, as they did not have enough values to carry on furthercomputation. The above experiment shows that 12 Gaussianmixtures provide the optimum accuracy for the vowel formant.

We next present the results for all four databases and com-pare out proposed scheme to both BPNN and GMM-UBM.The performance accuracy results have been averaged and aresummarised in Table 3.

The same vowel formants were analysed with BPNN [52]for identification against the same training files and their ex-tracted formants for comparison with the proposed score-basedscheme. The same training vowel formants were also suppliedas inputs to GMM-UBM, and the test sets were tested againstthese mixtures. The experiments revealed that the vowel for-mants with the score-based strategy are not only more accuratein identification but also more scalable, as highlighted next.

An identification algorithm is critically evaluated for itsaccuracy against the test data for a number of speakers. Thecontext of evaluation becomes more critical if the algorithmaims to be applicable for industry devices for biometric secu-rity and identity management [53]. However, as the number ofspeakers increase, traditional algorithms start declining in ac-curacy. This trend has been an important consideration duringthe design and testing of the current system to ensure that it is ascalable model. The performance graph in Fig. 6 shows theperformance statistics as the number of speakers is graduallyincreased from 10 to 110 in increments of 10. It is to be notedthat both GMM and BPNN start decreasing in accuracy as the

number of speakers increases, whereas the proposed score-based scheme shows a fairly stable accuracy that is barelyaffected by increasing the number of speakers.

Next we present the receiver operating characteristic(ROC) curve for the three algorithms using the YOHO data-base in Fig. 7. It shows the distribution of the area under thecurve when plotted with the results of the tests. It clearlyshows the maximum area is covered with score-based schemeas compared to the other schemes.

During these tests, the identification and training times forthe score-based strategy was also observed, as shown inTables 4 and 5. Although BPNN requires less identificationtime compared with the proposed score-based scheme for mostdatabases, it has a much higher training time and poor accuracy.

It is to be observed that the score-based scheme does notrequire any training other than saving the filtered vowel for-mants in the database, which in this case is a Matlab file. Notethat the identification time does not include the preprocessing

Table 2 Training and identification times for GMM and GMM-UBMwith different number of speakers.

Number of speakers GMM GMM-UBM

20 100 20 100

Training time 68 251 110 430

Identification time 2.4 7.8 3.6 11.5

Table 3 Performance comparison (percentage accuracy) of theproposed score-based strategy, BPNN and GMM-UBM algorithms fordifferent databases.

Database Score-based schemeformants

Formantswith BPNN

Formants withGMM-UBM

YOHO 94.23 % 62.14 % 75.84 %

NIST 92.15 % 54.51 % 72.73 %

TI_digits1 96.87 % 57.58 % 69.42 %

TI_digits2 97.34 % 59.12 % 73.59 %

0 20 40 60 80 100 1200

10

20

30

40

50

60

70

80

90

100

Number of Speakers

Per

form

ance

Acc

urac

y (%

)

Score-based scheme Formants

Formants with BPNN

Formants with GMM-UBM

Figure 6 Performance statistics (percentage accuracy) of the threealgorithms with varying number of speakers.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate (1 - Specificity)

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity)


Formants with GMM-UBM

Formants with BPNN

Figure 7 ROC curve for the three algorithms for YOHO.


time with LPC and vowel formant filtering, as that time isconsidered to be common for all of the algorithms tested. Ifthe number of speakers is increased, the identification time isexpected to increase for any algorithm. As shown in Fig. 8,our proposed scheme has much less identification times com-pared to the other schemes.

Next we present the effect of the length of speech files. Thedatabases we used have different lengths for speech files. Bytrimming them down to 3, 2 and 1 s and testing our systemusing these speech signals, we obtained the accuracy resultspresented in Table 6. As expected, the longer speech segmentsprovide the best results. All other results presented in thispaper are based on 3 s long speech signals.

Finally, we present some results on using different numberof formants. Using more formants do not necessarily increasethe overall accuracy rates. Usually, additional formants pro-vide a tighter threshold for comparison purposes and helpsreduce false accept rates (FAR). But they also cause falsereject rates (FRR) to increase, thereby reducing the overallaccuracy rate. This phenomenon is demonstrated in Fig. 9where the FAR and FRR results are presented for 3, 4 and 5formants. Moreover, using more formants drastically in-creases the training and identification times of the algorithm,thereby reducing the scope of the speaker identification sys-tem for many time-stringent applications.

The results presented in this section clearly show that theproposed scheme outperforms the BPNN and GMM classi-fiers. Both these classifiers are widely used and regarded asefficient schemes in many speaker identificationimplementations. However, in a vowel formant based scheme,they fail to perform at a desired level due to the followingreasons. BPNN has the problem of entrapment in local

minima, and the network should be trained with different ini-tial values until the best result is achieved. The number ofhidden layers and neurons in each layer are required to bedetermined. If the number of layers or neurons is inadequate,the network may not converge during the training; if the num-ber of the layers or neurons is chosen to be too high, this willdiminish the effectiveness of the network operation. This oftencauses this algorithm to be biased by a specific resonant fre-quency while completely disregarding the others. The pro-posed score-based scheme tackles this problem better byexploiting information received from all frequencies. Ascore-based scheme allows the speaker with the highestmatching formants to own the current signal. Furthermore,we choose LPC as the accompanying feature extraction strat-egy of our novel scheme, which is the best strategy due to itsresemblance with the functioning of the human vocal tract.Another difficulty of BPNN lies in its use of the back propa-gation algorithm that is too slow for practical applications,especially if many hidden layers are employed. The appropri-ate selection of training parameters in the BP algorithm issometimes difficult. As for the GMM algorithm, its main lim-itation is that, it can fail to work if the dimensionality of theproblem is too high. This causes the GMM to suffer badlywhen the number of speakers increases. Another disadvantageof the GMM algorithm is that the user must set the number of

Table 4 A comparison of the average training time (sec) for differentdatabases.

Score basedscheme

Formants withBPNN

Formants withGMM

YOHO 5.6 72.5 110

NIST 7.5 84.9 135.1

TI_digits1 4.5 68.4 85.6

TI_digits2 4.4 67.4 74.5

Table 5 A comparison of the average identification time (sec) for dif-ferent databases.

Score basedscheme

Formants withBPNN

Formants withGMM

YOHO 0.15 0.08 3.6

NIST 0.21 0.25 4.3

TI_digits1 0.14 0.12 2.8

TI_digits2 0.14 0.11 2.9

10 20 30 40 50 60 70 80 90 100 11010

-2

10-1

100

101

Number of Speakers

Iden

tific

atio

n T

ime

(sec

)


Formants with BPNN

Formants with GMM

Figure 8 Identification times (sec) of the three algorithms with varyingnumber of speakers for YOHO.

Table 6 Accuracy vs speech file lengths (across databases and/ortrimmed speech versions).

Speech file length: 1 s 2 s 3 s

YOHO 83.78 89.46 94.23

NIST 72.54 85.28 92.15

TI_digits1 78.56 91.25 96.87

TI_digits2 80.12 91.89 97.34


mixture models that the algorithm will try and fit to the train-ing dataset. In many instances the user will not know howmany mixture models should be used and may have to exper-iment with a number of different mixture models in order tofind the most suitable number of models that works for theirclassification problem.

5 Conclusions

Biometric authentication is a multi-disciplinary problem andrequires sound knowledge in machine learning, pattern recog-nition, digital signal processing, image processing and severalother overlapping fields such as artificial intelligence and sta-tistics. The objective of this research is to investigate the prob-lem of identifying a speaker from its voice regardless of thecontent (text-independent). This paper investigates the combi-nation of LPC-based vowel formants with a score-based iden-tification strategy. For comparison, two other combinations ofLPC-based vowel formants have been tested with BPNN andGMM. Comprehensive testing on the YOHO, NIST,TI_digits1 and TI_digits2 databases reveals that the proposedscheme outperforms BPNN and GMM-based schemes. It hasbeen observed that the proposed scheme requires very littletraining time other than creating a small database of vowelformants. Therefore, the proposed scheme is time-wise moreefficient as well. The results also show that increasing thenumber of speakers makes it difficult for BPNN and GMMto sustain their accuracy. Both of these models start losingaccuracy, whereas the proposed score-based methodology re-mains much more stable, making it scalable and suitable forlarge-scale implementations. In the future, we want to contin-ue further with the current approach to speaker identificationand combine it with real-time face recognition to make it morerobust and applicable for industry usage. We aim to combine

audio and visual features as a feature-level fusion in multi-modal neural networks to further improve the accuracythrough use of two biometric features.

Open Access This article is distributed under the terms of the CreativeCommons At t r ibut ion 4 .0 In te rna t ional License (h t tp : / /creativecommons.org/licenses/by/4.0/), which permits unrestricted use,distribution, and reproduction in any medium, provided you give appro-priate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made.

References

1. Kinsner, W., & Peters, D. (1988). A speech recognition systemusing linear predictive coding and dynamic time warping.Proceedings of the Annual International Conference of the IEEEEngineering in Medicine and Biology Society, 3, 1070–1071. doi:10.1109/IEMBS.1988.94689.

2. Campbell, J. P., Jr., Tremain, T. E., & Welch, V. C. (1990). Theproposed federal standard 1016 4800 bps voice coder. SpeechTechnology, 5, 58–64.

3. Aalto, D., Huhtala, A., Kivela, A., Malinen, J., Palo, P., Saunavaara,J., Vainio, M. (2012). Vowel formants compared with resonances ofthe vocal tract.

4. Chang, E., Zhou, J., Di, S., Huang, C., Lee, K. F. (2000). Largevocabulary mandarin speech recognition with different approachesin modeling tones. Proceedings of ICSLP.

5. Martin, A., & Przybocki, M. (1999). The NIST 1999 speaker recog-nition evaluation an overview.Digital Signal Processing, 10(1), 1–18.

6. Kivela, A., Kuortti, J., Malinen, J. (2013). Resonances and modeshapes of the human vocal tract during vowel production.Proceedings of 26th Nordic Seminar on Computational Mechanics.

7. Sambur, M. (1975). Selection of acoustic features for speaker iden-tification. IEEE Transactions on Acoustics, Speech, and SignalProcessing, 23(2), 176–182.

8. Rose, P. (2006). Technical forensic speaker recognition: evaluation,types and testing of evidence. Computer Speech & Language,20(2), 159–191. Elsevier.

9. Becker, T., Jessen, M., Grigoras, C. (2008). Forensic speaker veri-fication using formant features and Gaussian mixture models.Interspeech, 1505–1508.

10. McDougall, K. & Nolan, F. (2007). Discrimination of speakersusing the formant dynamics in British English. Proceedings of the16th International Congress of Phonetic Sciences, (pp. 1825–1828).

11. Grieve, J., Speelman, D., & Geeraerts, D. (2013). A multivariatespatial analysis of vowel formants in American English. Journal ofLinguistic Geography, 1(1), 31–51. Cambridge Univ. Press.

12. Shao, X. & Milner B. (2004). Pitch prediction from MFCC vectorsfor speech reconstruction. Proc. of the 2004 InternationalConference on Acoustics, Speech, and Signal Processing, (pp.97–100). Montreal.

13. Afify, M., & Siohan, O. (2007). Comments on vocal tract lengthnormalisation equals linear transformation in cepstral space. IEEETransactions on Audio, Speech and Language Processing, 15(5),1731–1732.

14. Chazan, D., Cohen, G., Hoory, R., Zibulski, M. (2000). Speechreconstruction from Mel frequency cepstral coefficients and pitch.Proceedings of ICASSP.

1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Normalized Threshold

Fal

se A

ccep

t an

d R

ejec

t R

ates

FAR: 3 formants

FRR: 3 formants

FAR: 4 formantsFRR: 4 formants

FAR: 5 formants

FRR: 5 formants

Figure 9 FAR and FRR for different number of formants.


http://dx.doi.org/10.1109/IEMBS.1988.94689

15. Shao, X. & Milner B. (2004).MAP prediction of pitch fromMFCCvectors for speech reconstruction. Proc. ICSLP.

16. Atal, B. S. (2006). The history of linear prediction. IEEE SignalProcessing Magazine, 23(2), 154–161.

17. Aliyu, A. O., Adewale, E. O., Adetunmbi, O. S. (2013).Development of a text-dependent speaker recognition system.Journal of Development, 69(16).

18. Das, B. P. (2012). Recognition of isolated words using featuresbased on LPC, MFCC, ZCR and STE, with neural networkclassifiers. Jadavpur University Kolkata.

19. Yuanyou, X., Yanming, X., & Ruigeng, Z. (1997). An engineeringgeology evaluation method based on an artificial neural networkand its application. Engineering Geology, 47(1), 149–156.

20. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speakerverification using adapted Gaussian mixture models. Digital SignalProcessing, 10(3), 19–41.

21. Wagner, M. (2012). Speaker identification using glottal-sourcewaveforms and support-vector-machine modelling. SST 2012.Sydney: Macquarie University.

22. Reynolds, D. A., & Rose, R. C. (1995). Robust text-independentspeaker identification using Gaussian mixture speaker models.IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.

23. G. S. University. (2013). http://hyperphysics.phyastr.gsu.edu/hbase/music/vowel2.html. Hyperphysics education website.Accessed 12 Dec 2013.

24. Flanagan, J. L., & Rabiner, L. R. (1973). Speech synthesis.Stroudsburg: Dowden, Hutchinson, and Ross, Inc.

25. Rabiner, L. R., & Schafer, R. W. (1978). Digital processing ofspeech signals (Prentice-Hall signal processing series).Englewood Cliffs: Prentice-Hall.

26. Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer hand-book of speech processing. Berlin: Springer.

27. Ramachandran, R., Zilovic, M. S., & Mammone, R. (1995). Acomparative study of robust linear predictive analysis with applica-tion to speaker identification. IEEE Transactions on Speech andAudio Processing, 3(2), 117–125.

28. Rabiner, L. R., & Juang, B. (1993). Fundamentals of speechrecognition. Englewood Cliffs: Prentice Hall.

29. Wells, J. C. (1962). A study of the formants of the pure vowels ofBritish English. (unpublished master’s thesis). University ofLondon.

30. Keller, J. B. (1953). Bowing of violin strings. CommunicationsPure and Applied Mathematics, 6, 483–495.

31. Fant, G. (1960). Acoustic theory of speech production. The Hague:Mouton.

32. Markel, J. D., & Gray, A. H. (1976). Linear prediction of speech.New York: Springer Verlag.

33. Keurs, M. T., Festen, J. M., & Plomp, R. (1993). Effect of spectralenvelope smearing on speech reception II. The Journal of theAcoustical Society of America, 93, 1547.

34. Baer, T., & Moore, B. C. J. (1994). Effects of spectralsmearing on the intelligibility of sentences in the presence ofinterfering speech. The Journal of the Acoustical Society ofAmerica, 95, 2277.

35. Drullman, R., Festen, J. M., & Plomp, R. (1994). Effect of reducingslow temporal modulations on speech reception. The Journal of theAcoustical Society of America, 95, 2670.

36. Drullman, R., Festen, J. M., & Houtgast, T. (1996). Effect of tem-poral modulation reduction on spectral contrasts in speech. TheJournal of the Acoustical Society of America, 99, 2358.

37. Fu, Q. J., & Nogaki, G. (2005). Noise susceptibility of cochlearimplant users: the role of spectral resolution and smearing.Journal of the Association for Research in Otolaryngology, 6(1),19–27.

38. Liu, C., & Fu, Q. J. (2007). Estimation of vowel recognition withcochlear implant simulations. IEEE Transactions on BiomedicalEngineering, 54(1), 74–81.

39. Mathworks. (2013). http://www.mathworks.com. Accessed 12Dec 2013.

40. Stevens, K. (1971). Sources of inter- and intra-speaker variability inthe acoustic properties of speech sounds. Proc. 7th Intern. Congr.Phon. Sc., Montreal, (pp 206–227).

41. Atal, B. S. (1972). Automatic speaker recognition based on pitchcontours. JASA, 52, 1687–1697.

42. Nolan, F. (1983). The phonetic bases of speaker recognition.Cambridge: Cambridge University Press.

43. Hollien, H. (1990). The acoustics of crime. The new science offorensic phonetics. Nueva York: Plenum.

44. Kuwabara, H., & Sagisaka, Y. (1995). Acoustic characteristics ofspeaker individuality: control and conversion. SpeechCommunication, 16, 165–173.

45. Rottland, J., Neukirchen, C., Willett, D., Rigoll, G. (1997). Largevocabulary speech recognition with context dependent MMI-con-nectionist/HMM systems using the WSJ database. EUROSPEECH.

46. Campbell, J., & Higgins, A. (1994). YOHO speaker verificationLDC94S16.Web download. Philadelphia: Linguistic Data Consortium.

47. Hamzah, R., Jamil, N., Seman, N. (2014). Filled pause classifica-tion using energy-boosted mel-frequency cepstrum coefficients. InProc. Int. Conference on Robotic, Vision, Signal Processing &Power Applications.

48. NISTMultimodal Information Group. (2008). NISTspeaker recog-nition evaluation test set LDC2011S08. Web download (p. 2011).Philadelphia: Linguistic Data Consortium.

49. Leonard, R. G., & Doddington, G. (1993). TIDIGITS LDC93S10.Web download. Philadelphia: Linguistic Data Consortium.

50. Lee, D.-S. (2005). Effective Gaussian mixture learning for videobackground subtraction. IEEE Transactions on Pattern Analysisand Machine Intelligence, 27(5), 827–832.

51. Lee, D-S., Hull, J. J., Erol, B. (2003). A Bayesian framework forGaussian mixture background modeling. Proceedings ofInternational Conference on Image Processing, 3.

52. Templetom, P. D. & Guillemin, B. J. (1990). Speaker identificationbased on vowel sounds using neural networks. Proceedings of the3rd International Conference on Speech Science and Technology,Australian Association, 280–285.

53. Miles, M. J. (1989). Speaker identification based upon an analysisof vowel sounds and its applications to forensic work. (Unpublishedmaster’s thesis). University of Auckland, New Zealand.

Noor Almaadeed received herPh.D. from Brunel University,London, UK, in 2014. Her areasof research include speech signaldetection, speaker identification,audio/visual speaker recognition,etc. She received her Bachelor’sin Computer Science from Qataruniversity in 2000, and Master ofScience in Computer and Infor-mation Sciences from City Uni-versity, UK, in 2005. She is anAssistant Professor in the Depart-ment of Computer Science & En-gineering in Qatar University. She

was awarded the Qatar Education Excellence Day Platinum Award -NewPhD Holders Category 2014-2015.


http://hyperphysics.phyastr.gsu.edu/hbase/music/vowel2.html

http://hyperphysics.phyastr.gsu.edu/hbase/music/vowel2.html

http://www.mathworks.com/

Prof. Amar Aggoun is a Readerin Information and communica-tion Terchnologies, Brunel Uni-versity, London, and the Head ofComputer Science and Technolo-gy Department, University ofBedfordshire. He received hisPh.D. from University of Notting-ham in Image/Video processing in1991. His areas of research in-clude 3DTV, video signal pro-cessing and compression, 3Dcomputer graphics, audio-visualdelivery, computer architectures,and computer vision systems.

Prof. Abbes Amira took aca-demic and consultancy positionsin UK and overseas, includinghis current positions as a Profes-sor in Computer Engineering atQatar University, Qatar and a fullprofessor in visual communica-tions and leader of the VisualCommunication Cluster at theUniversity of the West of Scot-land (UWS), UK. He took otheracademic positions at the Univer-sity of Ulster-UK, Qatar Universi-ty-Qatar, Brunel University-UKand Queen’s University Belfast-

UK. He received his Ph.D. in Computer Engineering from Queen’s Uni-versity, Belfast, United Kingdom, in 2001. His areas of research includeimage and vision systems, embedded systems, image/video processingand analysis, pattern recognition, and connected health and security.


Text-Independent Speaker Identification Using Vowel Formants · current utterance –and Speaker...

Documents

Transcript of Text-Independent Speaker Identification Using Vowel Formants · current utterance –and Speaker...