Signal adaptive spectral envelope estimation for robust speech recognition

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 51 (2009) 551–561

Signal adaptive spectral envelope estimationfor robust speech recognition

Matthias Wolfel

Institut fur Theoretische Informatik, Universitat Karslruhe (TH), Am Fasanengarten 5, 76131 Karlsruhe, Germany

Received 29 May 2008; received in revised form 26 January 2009; accepted 24 February 2009

Abstract

This paper describes a novel spectral envelope estimation technique which adapts to the characteristics of the observed signal. This ispossible via the introduction of a second bilinear transformation into warped minimum variance distortionless response (MVDR) spectralenvelope estimation. As opposed to the first bilinear transformation, however, which is applied in the time domain, the second bilineartransformation must be applied in the frequency domain. This extension enables the resolution of the spectral envelope estimate to besteered to lower or higher frequencies, while keeping the overall resolution of the estimate and the frequency axis fixed. When embeddedin the feature extraction process of an automatic speech recognition system, it provides for the emphasis of the characteristics of speechfeatures that are relevant for robust classification, while simultaneously suppressing characteristics that are irrelevant for classification.The change in resolution may be steered, for each observation window, by the normalized first autocorrelation coefficient.

To evaluate the proposed adaptive spectral envelope technique, dubbed warped-twice MVDR, we use two objective functions: classseparability and word error rate. Our test set consists of development and evaluation data as provided by NIST for the Rich Transcrip-tion 2005 Spring Meeting Recognition Evaluation. For both measures, we observed consistent improvements for several speaker-to-microphone distances. In average, over all distances, the proposed front-end reduces the word error rate by 4% relative compared tothe widely used mel-frequency cepstral coefficients as well as perceptual linear prediction.� 2009 Elsevier B.V. All rights reserved.

Keywords: Adaptive feature extraction; Spectral estimation; Minimum variance distortionless response; Automatic speech recognition; Bilinear trans-formation; Time vs. frequency domain

1. Introduction

Acoustic modeling in automatic speech recognition

(ASR) requires that a windowed speech waveform isreduced to a set of representative features which preservesthe information needed to determine the phonetic classwhile being invariant to other factors. Those factors mightinclude speaker differences such as fundamental frequency,accent, emotional state or speaking rate, as well as distor-tions due to ambient noise, the channel or reverberation.In the traditional feature extraction process of ASRsystems, this is achieved through successive feature trans-

0167-6393/$ - see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2009.02.006

E-mail address: [email protected]

formations (e.g. a spectral envelope and/or filterbankfollowed by cepstral transformation, cepstral normalizationand linear discriminant analysis) whereby all phonemetypes are treated equivalently.

Different phonemes, however, have different propertiessuch as voicing where the excitation is due to quasi-peri-odic opening of the vocal cord or classification relevant fre-quency regions (Olive, 1993; Mesgarani et al., 2007;Driaunys et al., 2005). While low frequencies are more rel-evant for vowels, high frequencies are more relevant forfricatives. It is thus a natural extension to the traditionalfeature extraction approach to vary the spectral resolutionfor each observation window according to some character-istics of the observed signal. To improve phoneme classifi-cation, the spectral resolution may be adapted such that

mailto:[email protected]

552 M. Wolfel / Speech Communication 51 (2009) 551–561

characteristics relevant for classification are emphasizedwhile classification irrelevant characteristics are attenuated.

To achieve these objectives, we have proposed to extendthe warped minimum variance distortionless response

(MVDR) through a second bilinear transformation(Wolfel, 2006). This spectral envelope estimate has two freeparameters to control spectral resolution: the model order,which changes the number of linear prediction coefficients,and the warp factor. While the model order allows theoverall spectral resolution to be changed, the warp factorenables the spectral resolution to be steered to lower orhigher frequency regions without changing the frequencyaxis. Note that this is in contrast to the previously pro-posed warped MVDR (Wolfel et al., 2003; Wolfel andMcDonough, 2005), wherein the warp factor has an influ-ence on both the spectral resolution and the frequency axis.

A note about the differences between the present publi-cation and (Wolfel, 2006) is perhaps now in order. Thepresent publication presents important backgroundinformation which had been discarded in the conferencepublication (Wolfel, 2006) because of space limitationsincluding: a comparison of well known and not so wellknown ASR front-ends on close and distant recordings interms of word error rate and class separability. The presentpublication also includes a detailed analysis and discussionof phoneme confusability. In addition it fosters under-standing by highlighting the differences between warpingin the time and frequency domain and investigating ofthe values of the steering function in relation to single pho-nemes and phoneme classes.

The balance of this paper is organized as follows. A briefreview of spectral envelope estimation techniques with afocus on MVDR is given in Section 2. The bilinear trans-formation is reviewed in Section 3 where its properties inthe time and frequency domains are discussed. Section 4introduces a novel adaptive spectral estimation technique,dubbed warped-twice MVDR, and a fast implementationthereof. A possible steering function, to emphasize pho-neme relevant spectral regions, is discussed in Section 5.The proposed signal adaptive feature extraction scheme is

Table 1Properties of spectral estimation methods.

Spectral estimate

PSMel-scale PS (Stevens et al., 1937)LP (Yule, 1927; Makhoul, 1975)Perceptual LP (Hermansky, 1990)Warped LP (Strube, 1980; Matsumoto and Moroto, 2001)Warped-twice LPa (Nakatoh et al., 2004)MVDR (Murthi and Rao, 1997, 2000; Dharanipragada and Rao, 2001)Warped MVDR (Wolfel and McDonough, 2005)Perceptual MVDR (Dharanipragada et al., 2007)Warped-twice MVDR

PS is the power spectrum; LP is the linear prediction; and MVDR is the minia No particular name is given in the work by Nakatoh et al.

evaluated in Section 6. Our conclusions are presented inthe final section of this paper.

2. MVDR spectral envelope

In the feature extraction stage of speech recognition sys-tems, particular characteristics of the spectral estimate arerequired. To name a few: provide a particular spectral res-olution, be robust to noise, and model the frequencyresponse function of the vocal tract during voiced speech.To satisfy these requirements, both non-parametric andparametric methods have been proposed. Non-parametricmethods are based on periodograms, such as power spec-tra, while parametric methods such as linear predictionestimate a small number of parameters from the data.Table 1 summarizes the characteristics of different spectralestimation methods. Two widely used methods in ASR aremel-scale power spectrum (Davis and Mermelstein, 1980)and warped or perceptual linear prediction (Hermansky,1990).

In order to overcome the problems associated with(warped or perceptual) linear prediction, namely overesti-mation of spectral power at the harmonics of voicedspeech, Murthi and Rao (1997, 2000) proposed the use ofminimum variance distortionless response (MVDR), whichis also known as Capon’s method (Capon, 1969) or themaximum-likelihood method (Musicus, 1985), for all-polemodeling of speech in 1997. They demonstrated thatMVDR spectral envelopes cope well with the aforemen-tioned problem. Some years later, in 2001, MVDR wasapplied to speech recognition by Dharanipragada andRao (2001). To account for the frequency resolution ofthe human auditory system, we have introduced warped

MVDR (Wolfel et al., 2003; Wolfel and McDonough,2005). It extends the MVDR approach by warping the fre-quency axis with a bilinear transformation in the timedomain.

In this section, we briefly review MVDR spectral estima-tion. A detailed discussion of speech spectral estimation byMVDR can be found in (Murthi and Rao, 2000), with

Properties

Detail Resolution Sensitive to pitch

Exact Linear, static Very highSmooth Mel, static HighApprox. Linear, static MediumApprox. Mel, static MediumApprox. Mel, static MediumApprox. Mel, adaptive mediumApprox. Linear, static LowApprox. Mel, static LowApprox. Mel, static LowApprox. Mel, adaptive Low

mum variance distortionless response.

M. Wolfel / Speech Communication 51 (2009) 551–561 553

focus on speech recognition and warped MVDR in (Wolfeland McDonough, 2005), and with focus on robust featureextraction for recognition in (Dharanipragada et al., 2007).

2.1. MVDR methodology

MVDR spectral estimation can be posed as a problem infilterbank design, wherein the final filterbank is subject tothe distortionless constraint (Haykin, 1991):

The signal at the frequency of interest xfoi must passundistorted with unity gain:

H foiðejxfoiÞ ¼XM

k¼0

hfoiðkÞe�jkxfoi ¼ 1; ð1Þ

where the impulse response hfoiðkÞ of the distortionless fi-nite impulse response filter of order M is specifically de-signed to minimize the output power. Defining the fixed

frequency vector

vðejxÞ ¼ ½1; eþjx; . . . ; eþjMx�T ð2Þ

allows the constraint to be rewritten in vector form as

vH ðejxfoiÞ � hfoi ¼ 1; ð3Þ

where ð�ÞH represents the Hermitian transpose operatorand

hfoi ¼ ½hfoið0Þ; hfoið1Þ; . . . ; hfoiðMÞ�T ð4Þ

is the distortionless filter.Upon defining the autocorrelation sequence

R½n� ¼XL�n

m¼0

x½m�x½m� n� ð5Þ

of the input signal x of length L as well as the ðM þ 1Þ�ðM þ 1Þ Toeplitz autocorrelation matrix R whose ðl; kÞthelement is given by

Rl;k ¼ R½l� k� ð6Þ

it is readily shown that hfoi can be obtained by solving theconstrained minimization problem:

minhfoi

hTfoiRhfoi subject to vH ðejxfoiÞhfoi ¼ 1: ð7Þ

The solution to this problem is given by Haykin (1991):

hfoi ¼ R�1vðejxfoiÞvH ðejxfoiÞR�1vðejxfoiÞ:

ð8Þ

This implies that hfoi is the impulse response of the distor-tionless filter for the frequency xfoi. The MVDR envelopeof the power spectrum of the signal P ðejxÞ at frequencyxfoi is then obtained as the output of the optimized con-strained filter:

SMVDRðejxfoiÞ ¼ 1

2p

Z p

�pH foiðejxÞj j2P ðejxÞdx: ð9Þ

Although MVDR spectral estimation was posed as a dis-tortionless filter design for a given frequency xfoi, the

MVDR spectrum can be represented in parametric formfor all frequencies (Haykin, 1991)

SMVDRðejxÞ ¼ 1

vH ðejxÞR�1vðejxÞ: ð10Þ

2.2. Fast computation of the MVDR envelope

Assuming that the ðM þ 1Þ � ðM þ 1Þ Hermitian Toep-litz correlation matrix R is positive definite and thus invert-ible, Musicus (1985) derived a fast algorithm to calculatethe MVDR spectrum from a set of linear prediction coeffi-

cients (LPCs). The steps (i until iii) of Musicus’ algorithm(Musicus, 1985) are:

(i) Computation of the LPCs aðMÞ0...M of order M includingthe prediction error variance �M .

(ii) Correlation of the LPCs

lk ¼1�M

PM�k

m¼0

ðM þ 1� k � 2mÞaðMÞm a�ðMÞmþk ; k P 0;

l��k; k < 0:

8<:ð11Þ

(iii) Computation of the MVDR envelope

SMVDRðejxÞ ¼ 1PMm¼�Mlme�jxm

: ð12Þ

(iv) Scaling of the MVDR envelopeIn order to improverobustness to additive noise it has been argued in(Wolfel and McDonough, 2005) to adjust the highestspectral peak of the MVDR envelope to match to thehighest spectral peak of the power spectrum to get theso called scaled envelope.

3. Warping – time vs. frequency domain

In the speech recognition community it is well knownthat features based on a non-linear frequency mappingimprove the recognition accuracy over features on a linearfrequency scale (Davis and Mermelstein, 1980). Transform-ing the linear frequency axis x to a non-linear frequencyaxis ~x is called frequency warping. One way to achieve fre-quency warping is to apply a non-linear scaled filterbank,such as a mel-filterbank, to the linear frequency representa-tion. An alternative possibility is to use a conformal map-ping such as a first order all-pass filter, also known as abilinear transformation (Oppenheim et al., 1971; Bracciniand Oppenheim, 1974), which preserves the unit circle.The bilinear transformation is defined in the z-domain as

~z�1 ¼ z�1 � a1� a � z�1

8 � 1 < a < þ1; ð13Þ

where a is the warp factor. The relationship between ~x andx is non-linear as indicated by the phase function of the all-pass filter (Matsumoto and Moroto, 2001)

Mel

Fre

quen

cy

Bili

near

App

roxi

mat

ion

0

2

4

8

6

0

2

4

8

6

Frequency (kHz)40 2 6 8

Fig. 1. Mel-frequency (scale shown along left edge) can be approximatedby a bilinear transformation (scale shown along right edge) demonstratedfor a sampling rate of 16 kHz, amel ¼ 0:4595.


arg e�j~x� �

¼ ~x ¼ xþ 2 arctana sin x

1� a cos x

� �: ð14Þ

The mel-scale, which along with the Bark scale is one ofthe most popular non-linear frequency mappings, was pro-posed by Stevens et al. (1937). It models the non-linearpitch perception characteristics of the human ear and iswidely applied in audio feature extraction. A good approxi-mation of the mel-scale by the bilinear transformation ispossible, if the warp factor is set accordingly. The optimalwarp factor depends on the sampling frequency and can befound by different optimization methods (Smith and Abel,1999). Fig. 1 compares the mel-scale with the bilinear trans-formation for a sampling frequency of 16 kHz.

Frequency warping by bilinear transformation caneither be applied in the time domain or in the frequency

domain. In both cases, the frequency axis is non-linearlyscaled; however, the effect on the spectral resolution differsfor the two domains. This effect can be explained asfollows:

– Warping in the time domain modifies the values in theautocorrelation matrix and therefore, in the case of lin-ear prediction, more linear prediction coefficients areused, for a > 0, to describe lower frequencies and lesscoefficients to describe higher frequencies.

No WarniamoD emiT ni gnipraW

80 42 6Frequency (kHz)

751 3 0 42Frequenc

1 3

(a) (b)

Changed Resolution

Fig. 2. Warping in (a) time domain, (b) no warping and (c) warping in frequresolution and frequency axis, warping in frequency domain does not alter th

– Warping in the frequency domain does not change thespectral resolution as the transformation is applied afterspectral analysis. As indicated by Nocerino et al. (1985),a general warping transformation in the same domain,such as the bilinear transformation, is equivalent to amatrix multiplication

fwarp½n� ¼ LðaÞf ½n�;

where the matrix LðaÞ depends on the warp factor. Itfollows that the values fwarp½n� on the warped scale area linear interpolation of the values f ½n� on the linearscale. In the case of linear prediction or MVDR, the pre-diction coefficients are not altered as they are calculatedbefore the bilinear transformation is applied.

Fig. 2 demonstrates the effect of warping applied eitherin the time or in the frequency domain on the spectral enve-lope and compares the warped spectral envelopes with theunwarped spectral envelope.

For clarity we briefly investigate the change of spectralresolution, for the most interesting case, where the bilineartransformation is applied in the time domain with warpfactor a > 0. In this case we observe that spectral resolu-tion decreases as frequency increases. In comparison tothe resolution provided by the linear frequency scale,a ¼ 0, the warped frequency resolution increases for lowfrequencies up to the turning point frequency (Harma andLaine, 2001)

ftpðaÞ ¼ �fs

2parccosðaÞ; ð15Þ

where fs represents the sampling frequency. At the turningpoint frequency, the spectral resolution is not affected.Above the turning point frequency, the frequency resolu-tion decreases in comparison to the resolution providedby the linear frequency scale. For a < 0, spectral resolutionincreases as frequency increases.

As observed by Strube (1980), prediction error minimi-zation of the predictors ~am in the warped domain is equiv-alent to the minimization of the output power of thewarped inverse filter

ping niamoD ycneuqerF ni gnipraW

86y (kHz)

75 0 842 6Frequency (kHz)

751 3

(c)

Same Resolution

ency domain. While warping in the time domain is changing the spectrale spectral resolution but still changes the frequency axis.

80 42 6

Frequency (kHz)751 3

Fig. 3. The plot of two warped-twice MVDR spectral envelopes demon-strates the effect of spectral tilt. While the spectral tilt is not compensatedfor the dashed line, it is compensated for the solid line. It is clear to seethat high frequencies are emphasized if no compensation is applied.


eAðzÞ ¼ 1þXM

m¼1

~am~z�mðzÞ ð16Þ

in the linear domain, where each unit delay element z�1 isreplaced by a bilinear transformation ~z�1. The predictionerror is therefore given by

EðejxÞ ¼ jeAðejxÞj2P ðejxÞ; ð17Þwhere P ðejxÞ is the power spectrum of the signal. The totalprediction error power can be expressed as

r2 ¼Z p

�pEðej~xÞd ~x ¼

Z p

�pEðejxÞW 2ðejxÞdx ð18Þ

80 42 6


Model Order

(a)

30

90

Fig. 4. The solid lines show warped-twice MVDR spectral envelopes with modare identical to a warped MVDR spectral envelope. Its counterparts with lowerThe arrows point in the direction of higher resolution. While the model ordermoves spectral resolution to lower or higher frequencies. At the turning poinchanges.

with

W ðzÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi1� a2p

1� az�1: ð19Þ

The minimization of the prediction error r2, however, doesnot lead to minimization of the power, but minimization ofthe power of the error signal filtered by the weighting filterW ðzÞ, which is apparent from the presence of this factor in(18). Thus, the bilinear transformation introduces an un-wanted spectral tilt. To compensate for this negative effect,we apply the inverted weighting function

eW ð~zÞ � eW ð~z�1Þ�� 1 ¼ 1þ a � ~z�1j j2

1� a2: ð20Þ

The effect of the spectral tilt of the bilinear transformationand the remedy by (20) are depicted in Fig. 3.

4. Warped-twice MVDR spectral envelope

The use of two bilinear transformations, one in timedomain and the other in frequency domain, introducestwo additional free parameters into the MVDR approach(Wolfel, 2006). The first free parameter, the model order,is already determined by the underlying linear predictionmodel. Due to the application of two bilinear transforma-tions which apply two warping stages into MVDR spectralestimation, the proposed approach is dubbed warped-twice

80 42 6


Warp

(b)

0.3

0.6

el order 60, a ¼ 0:4595 and amel ¼ 0:4595 which, except for the spectral tilt,and higher (a) model order and (b) warp factor a are given by dashed lines.changes the overall spectral resolution at all frequencies, the warp factor

t frequency, the resolution is not affected and the direction of the arrows

BilinearTransformation

Autocorrelation

Compensationfor Spectral Tilt

Levinson-DurbinRecursion

Correlation ofwarped LPC

Warped-MVDR

CompensationWarp Factor

Windowed Time Signal

Warped-Twice MVDR

Warp Factor

BilinearTransformation

ConcatenatedWarp Factor

Fig. 5. Overview of warped-twice minimum variance distortionlessresponse. Symbols are defined as in the text.


MVDR. While the model order varies the overall spectralresolution of the estimate, which becomes apparent by com-paring the different envelopes for model order 30, 60 and 90in Fig. 4a, the warp factors bend the frequency axis as alreadyseen in Section 3. Bending the frequency axis can be used toapply the mel-scale or, when done on a speaker-dependentbasis, to implement vocal tract length normalization (VTLN),although the latter is not used in the experiments described inSection 6, as piece-wise linear warping leads to better results(Wolfel, 2003).

As already mentioned in Section 1, our aim is to changethe spectral resolution while keeping the frequency axisfixed. This becomes possible by compensating for theunwanted bending of the frequency axis, introduced bythe first warping stage in the time domain, by a secondwarping stage in the frequency domain. An example isgiven in Fig. 4b.

4.1. Fast computation of the warped-twice MVDR envelope

A fast computation of the warped-twice MVDR enve-lope of model order M is possible by extending Musicus’algorithm. A flowchart diagram of the individual process-ing steps is given in Fig. 5.

(i) Computation of the warped autocorrelation coefficientseR½0� � � � eR½M þ 1�:To compute warped autocorrelation coefficients, thelinear frequency axis x has to be transformed to awarped frequency axis ~x by replacing the unit delay

element z�1 with a bilinear transformation (13). Thisleads to the warped autocorrelation coefficients (Mat-sumoto et al., 1998; Matsumoto and Moroto, 2001)

eR½n� ¼ XL�n�1

m¼0

x½m�yn½m�; ð21Þ

where yn½m� is the sequence of length L given by

yn½m� ¼ a � ðyn½m� 1� � yn�1½m�Þ � yn�1½m� 1� ð22Þ

and initialized with y0½m� ¼ x½m�.Note that we need to calculate M þ 1 warped auto-correlation coefficients (the additional coefficient isused in the compensation step).

(ii) Calculation of the compensation warp factor:To fit the final frequency axis to the mel-scale, weneed to compensate for the first warping stage withvalue a in a second warping stage with the warpfactor

b ¼ a� amel

1� a � amel

: ð23Þ

(iii) Compensation for the spectral tilt:To compensate for the distortion introduced by theconcatenated bilinear transformations with warp fac-tors a and b, we first concatenate the cascade of warp-ing stages into a single warping stage with the warpfactor

v ¼ aþ b1þ a � b : ð24Þ

A derivation of (24) is provided in (Acero et al.,1990). To get a flat transfer function, we now applythe inverted weighting function

eW ð~zÞ � eW ð~z�1Þ�� 1 ð25Þ

to the warped autocorrelation coefficients, which canbe realized as a second order finite impulse responsefilter:

bR½m� ¼ 1þ v2 þ v � eR½m� 1� þ v � eR½mþ 1�1� v2

: ð26Þ

(iv) Computation of the warped LPCs baðMÞ0...M including thewarped prediction error variance b�M :The warped LPCs can now be estimated using theLevinson–Durbin recursion (Oppenheim and Scha-fer, 1989), by replacing the linear autocorrelationcoefficients R with their warped and spectral tilt com-pensated counterparts bR.

(v) Correlation of the warped LPCs:The MVDR parameters bl�k can be related to theLPC by

blk ¼1b�M

PM�k

m¼0

ðM þ 1� k � 2mÞbaðMÞm ba�ðMÞmþk ; k P 0;

bl��k; k < 0:

8<:ð27Þ


(vi) Computation of the warped-twice MVDR envelope:The spectral estimate can now be obtained by

Fig. 6.unvoic

SW2MVDRðejxÞ ¼ 1PMm¼�Mblm

ejx�b1�b�ejx

: ð28Þ

Note that the spectrum (28), if b is set appropriately,is already resembling the non-linear frequency axis asdiscussed in Section 3. In those cases it is necessary toeither:(a) eliminate the non-linear spaced triangular filter-

bank as for example used in the extraction ofmel-frequency cepstral coefficients or perceptuallinear prediction coefficients, or

(b) replace the non-linear spaced triangular filter-bank by a filterbank of uniform half-overlap-ping triangular filters in order to providefeature reduction and additional spectralsmoothing.

(vii) Scaling of the warped-twice MVDR envelope:To provide more robustness we match the warped-

twice MVDR envelope to the highest spectral peak ofthe power spectrum.

4.2. Implementation issues

Frequency warping including linear or non-linearVTLN can be realized using filterbanks. Carefullyadjusted, those filterbanks can simulate the bilinear trans-formation in the frequency domain. In the case ofwarped-twice MVDR spectral estimation those filterbankscan be adjusted for each individual frame according to thecompensation warp factor b and the VTLN parameter. Inpractice it is sufficient to use a limited number of pre-calcu-lated filterbanks; in this way, warped-twice MVDR spectralestimation can be implemented with only a very small over-head when compared to warped MVDR spectralestimation.

5. Steering function

To support automatic speech recognition, the freeparameters of the warped-twice MVDR envelope have tobe adapted in such a way that classification relevant char-

S SH CH FJH T HH

DH Y AY V AA

ZHZ TH K D G P AE

IX

0.5

0.6

0.8

0.9

1.0

0.7

sibilants fricativesunvoiced

Values of the normalized first autocorrelation coefficient by phonemed (italic) and fricatives (bold) or for high values, e.g. nasals.

acteristics are emphasized while less relevant information issuppressed. Nakatoh et al. (2004) proposed a method forsteering the spectral resolution to lower or higher frequen-cies whereby for every frame i, the first two autocorrelationcoefficients were used to define the steering function

ui ¼Ri½1�Ri½0�

: ð29Þ

The zero autocorrelation coefficient R½0� represents theaverage power while the first autocorrelation coefficientR½1� represents the correlation of a signal. Thus u has ahigh value for voiced signals and a low value for unvoicedsignals. Fig. 6 gives the different values of the normalizedfirst autocorrelation coefficient u averaged over all samplesfor each individual phoneme. A clear separation betweenthe fricatives and non-fricatives can be observed. Fricatives

are consonants produced by forcing air through a narrowchannel made by placing two articulators close together.The sibilants are a particular subset of fricatives made bydirecting a jet of air through a narrow channel in the vocaltract towards the sharp edge of the teeth. Sibilants are lou-der than their non-sibilant counterparts, and most of theiracoustic energy occurs at higher frequencies than by non-sibilant fricatives. A detailed discussion about the proper-ties of different phoneme classes can be found in (Olive,1993).

To adjust for the sensitivity to the steering function thefactor c is introduced, and the subtraction of the bias�u ¼ 1

I

Piui (i.e. the mean over all values I in the training

set) keeps the average of a close to amel. This leads to

ai ¼ c � ui � �uð Þ þ amel: ð30Þ

The last equation is a slight modification of the originalformulation proposed by Nakatoh et al. As preliminaryexperiments have revealed that the word accuracy is notvery sensitive to c, we kept c fixed at 0.1; values around0.1 might lead to slightly, however, not significantly differ-ent results. The influence of c has been, in more detail,investigated in (Nakatoh et al., 2004).

6. Evaluation

To evaluate the proposed warped-twice MVDR spectralestimation and steering function against traditional front-ends such as perceptual linear prediction (PLP) (Herman-

AWEH

AXIH R AO L

AX

R

OW W XL NEY

AH IY ER B UH

OY

UW

NG

XN

XMM

nasals

es. Different phone classes group either for small values, e.g. sibilants,

Table 2Average class separability and average word error rates for different front-end types and sanity checks.

Spectrum MO CC CS Word error rate %

Pass 1 2 3

Power spectrum – 13 15.204 48.5 41.6 39.5Power spectrum – 20 15.995 48.5 41.6 39.2PLP 13 13 15.075 47.4 41.0 39.2PLP 20 20 15.625 47.3 41.6 39.6Warped MVDR 60 13 15.199 48.5 41.6 39.6Warped MVDR 60 20 15.821 47.6 40.4 38.5Warped-twice LP 20 13 15.302 48.9 42.1 39.6Warped-twice LP 20 20 15.806 47.6 40.7 38.4Warped-twice MVDR 60 13 15.731 48.1 41.3 38.9Warped-twice MVDR 60 20 16.206 47.4 39.8 37.7

MO is the model order, CC is the number of cepstral coefficients, and CSis the class separability.


sky, 1990), mel-frequency cepstral coefficients (MFCC)(Davis and Mermelstein, 1980) and more recently proposedfront-ends based on warped-twice LP or warped MVDRspectral envelopes, we used NIST’s development and eval-uation data of the Rich Transcription 2005 Spring MeetingRecognition Evaluation (NIST, 2005). The data has beenchosen as a test environment as it contains challengingacoustic environments on both close and distant speechrecordings. The development data, sampled at 16 kHz,consists of five seminars with approximately 130 min ofspeech. The evaluation data, also sampled at 16 kHz, con-sists of 16 seminars with approximately 180 min of speech.The data was collected under the Computers in the Human

Interaction Loop (CHIL) project (CHIL, xxxx) and con-tains spontaneous, native and non-native speech.

We have used the Janus Recognition Toolkit (JRTk). Totrain acoustic models only relatively little supervised in-domain speech data is available. Therefore, we decided totrain the acoustic models on close talking channels of meet-ing corpora and the Translanguage English Database

(TED) corpus (LDC, xxxx), summing up to a total ofapproximately 100 h of acoustic training material. Aftersplit and merge training the acoustic model consisted ofapproximately 3500 context-dependent codebooks withup to 64 diagonal covariance Gaussians each, summingup to a total of 180,000 Gaussians.

Each front-end provided features every 10 ms (first andsecond pass) or 8 ms (third pass). Spectral estimates havebeen obtained by the Fourier transformation (MFCC),PLP, warped MVDR, warped-twice LP and warped-twiceMVDR spectral estimation. While the Fourier transforma-tion is followed by a mel-filterbank, warped MVDR,warped-twice LP and warped-twice MVDR are followedby a linear filterbank. The 30 (13 or 20 in the case ofPLP) spectral features have been truncated to 13 or 20cepstral coefficients after cosine transformation. Aftermean and variance normalization, the cepstral featureswere stacked (seven adjacent left and right frames provid-ing either 195 or 300 dimensions) and truncated to the finalfeature vector dimension of 42 by a multiplication with theoptimal feature space matrix (the linear discriminant anal-ysis matrix multiplied with the global semi-tied covariancetransformation matrix (Gales, 1999)).

To train a four-gram language model, we used corporaconsisting of broadcast news, proceedings of conferencessuch as ICSLP, Eurospeech, ICASSP, ACL and ASRUand talks in TED. The vocabulary contains approximately23,000 words, the perplexity is 120 with an out-of-vocabu-lary rate of 0.25%.

We compare the different front-ends on class separabil-ity and word error rate (WER).

6.1. Class separability

Class separability is a classical concept in pattern recog-nition, usually expressed using a scatter matrix. We candefine

– the within-class scatter matrix (Sw)

Sw ¼XC

c¼1

XNc

n¼1

ðxcn � lcÞðxcn � lcÞT

" #; ð31Þ

– the between-class scatter matrix (Sb)

Sb ¼XC

c¼1

Ncðlc � lÞðlc � lÞT ð32Þ

– and the total scatter matrix (St)

S t ¼ Sw þ Sb ¼XC

c¼1

XNc

n¼1

ðxcn � lÞðxcn � lÞT" #

; ð33Þ

where Nc denotes the number of samples in class c, lc

is the mean vector for the cth class, and l is the globalmean vector over all classes C.

We would like to derive feature vectors such that all vec-tors belonging to the same class (e.g. phoneme) are closetogether in feature space and well separated from the fea-ture vectors of other classes (e.g. all other phonemes). Thisproperty can be expressed using the scatter matrices; asmall within-class scatter and a large between-class scatterstand for large class separability. Therefore, an approxi-mate measure of class separability can be expressed byHaeb-Umbach (1999)

Dd ¼ traced S�1w Sb

� ; ð34Þ

where traced is defined as the sum of the first d eigenvalueski of S�1

w � Sb (a d-dimensional subspace) and hence the sumof the variances in the principal directions.

Comparing the class separability of different spectralestimation methods in Table 2 we first note that a highernumber of cepstral coefficients always results in a higherclass separability. Comparing the class separability, for20 cepstral coefficients, on different front-ends we observethat class separability increases from PLP, warped-twiceLP, warped MVDR, power spectrum to warped-twiceMVDR. The class separability is significantly lower for

Table 3Class separability and word error rates for different front-end types and settings on close microphone recordings.

Spectrum Model Cepstra Class separability Word error rate %

Test set Train Develop Eval. Develop Eval.

Pass 1 2 3 1 2 3

Power spectrum – 13 11.007 16.470 16.088 36.1 30.3 28.0 35.3 29.7 27.7Power spectrum – 20 11.620 17.929 16.299 36.0 29.7 27.7 37.2 31.3 28.4PLP 13 13 10.699 17.110 15.152 34.7 29.3 27.2 34.2 29.6 27.1PLP 20 20 11.029 18.059 16.068 34.7 29.5 27.7 34.9 30.3 27.9Warped MVDR 60 13 10.768 16.813 16.261 35.0 30.0 28.2 35.5 29.9 27.6Warped MVDR 60 20 11.337 18.022 16.614 34.5 29.1 27.3 35.3 29.6 27.3Warped-twice LP 20 13 10.772 17.038 16.254 35.3 30.5 28.5 36.2 29.8 27.1Warped-twice LP 20 20 11.333 17.864 16.436 34.4 29.5 27.4 37.1 29.4 26.8Warped-twice MVDR 60 13 1.893 17.673 16.456 34.5 29.5 27.5 34.1 29.2 27.0Warped-twice MVDR 60 20 11.473 18.510 16.818 34.1 28.8 26.8 35.4 29.0 26.3

Table 4Class separability and word error rates for different front-end types and settings on distant microphone recordings.

Spectrum Model Cepstra Class separability Word error rate %

Test set Train Develop Eval. Develop Eval.

Pass 1 2 3 1 2 3

Power spectrum – 13 11.007 14.786 13.470 61.9 52.0 51.1 60.8 54.2 51.1Power spectrum – 20 11.620 15.806 13.944 59.8 50.4 48.9 61.0 55.0 51.7PLP 13 13 10.699 15.121 12.917 60.7 51.8 50.5 59.9 53.4 51.8PLP 20 20 11.029 15.399 12.975 59.8 52.1 50.2 59.6 54.4 52.7Warped MVDR 60 13 10.768 13.836 13.885 62.9 53.7 52.0 60.7 52.8 50.7Warped MVDR 60 20 11.337 14.487 14.161 60.9 51.2 49.7 59.6 51.7 49.5Warped-twice LP 20 13 10.772 14.524 13.393 62.8 53.8 52.1 61.1 54.5 50.9Warped-twice LP 20 20 11.333 15.119 13.803 58.9 50.8 49.3 59.9 53.0 50.2Warped-twice MVDR 60 13 10.893 14.895 13.901 63.1 53.6 51.6 60.7 52.7 49.3Warped-twice MVDR 60 20 11.473 15.380 14.116 60.3 51.1 49.8 59.9 50.4 47.9


PLP and significantly higher for warped-twice MVDR,while warped-twice LP, warped MVDR and power spec-trum have nearly the same value.

On close talking microphone recordings in Table 3, weobserve that warped-twice MVDR provides features withthe highest separability on the development as well as theevaluation set. Averaging development and evaluation setthe warped-twice MVDR is followed by warped MVDR,warped-twice MVDR, power spectrum and PLP. On dis-tant microphone recordings, where the distance betweenspeakers and microphones varies between approximatelyone and three meters, the power spectrum has the highestclass separability on the development set. On the evalua-tion set, warped-twice MVDR performs equally well aswarped MVDR, see Table 4. Averaging development andevaluation set on the distant data the power spectrum pro-vides the highest class separability followed by warped-twice MVDR, warped-twice LP, warped MVDR and PLP.

6.2. Word error rates

The WERs of our speech recognition experiments fordifferent spectral estimation techniques and recognitionpasses are shown for close talking microphone recordingsin Table 3 and for distant microphone recordings in Table4. The first pass is unadapted while the second and third

pass are adapted on the hypothesis of the previous passusing maximum likelihood linear regression (MLLR) (Leg-getter and Woodland, 1995), constrained MLLR(CMLLR) (Gales, 1998) and VTLN (Welling et al., 2002).

Comparing the WERs of different spectral estimationmethods in Table 2 we observe that a higher number ofcepstral coefficients does not always result in a lowerWER. Power spectra, warped and warped-twice MVDRenvelopes tend to better performance with 20 cepstral coef-ficients while PLP performs better with 13 cepstralcoefficients.

The following discussion always refers to the lowerWER. In average warped-twice MVDR provides the low-est WER followed by warped-twice LP and warpedMVDR which perform equally well. PLP has a lowerWER on the first and second pass which equals on thethird compared to the power spectrum. PLP provides thelowest feature resolution which seems to be an advantageon the first pass, however, after model adaptation the lowerfeature resolution seems to be a disadvantage.

Investigating the WER on close microphone recordings,Table 3, we observe that the warped-twice MVDR front-end provides the best recognition performance, followedby PLP and warped-twice LP which are equally off.Warped MVDR ranks before the power spectrum whichhad the lowest recognition performance.

WX

LN

GN

XN

MX

M98

0.98

0.98

0.98

0.98

0.98

0.98

HL

NM

NN

L97

2.94

3.32

2.83

3.59

3.04

4.88

HL

NM

NN

XL

193.

183.

523.

013.

653.

35.

07

HL

NM

NN

XL

932.

963.

362.

773.

573.

034.

97

HL

NM

NN

XL

993

3.36

2.83

3.56

3.12

5.02


On distant microphone recordings, Table 3, the warped-twice MVDR front-end shows robust performance andhas, in average, the lowest WER. On the developmentset, however, the power spectrum has the lowest WER.In average the warped-twice MVDR is followed by warpedMVDR, then warped-twice LP, thereafter the power spec-trum due to a weak performance on the evaluation set andPLP on the last place.

The reduced improvements of the warped-twice MVDRin comparison to the warped MVDR on distant recordingscan be explained by the fact that, in comparison to closetalking microphone recordings, the range of the values ui

over all i is reduced. Therefore, the effect of spectral resolu-tion steering is attenuated and consequently warped-twiceMVDR envelopes behave more similarly to warpedMVDR envelopes.

Tab

le5

Nea

rest

ph

on

eme

dis

tan

cefo

rd

iffer

ent

ph

on

emes

(ord

ered

by

u)

and

spec

tral

esti

mat

ion

met

ho

ds.

Ph

on

eme

SS

HC

HZ

JHZ

HF

TH

TK

��

OW

OY

WU

u0.

510.

550.

600.

620.

730.

780.

800.

810.

850.

89��

0.97

0.97

0.97

0.

Spec

tru

mP

ow

ersp

ectr

um

Nea

rest

ZC

HJH

SC

HJH

TT

TH

P��

XL

OW

BU

Dis

tan

ce2.

411.

560.

812.

271.

361.

552.

362.

041.

752.

33��

3.19

3.55

3.04

2.

Spec

tru

mW

arp

edM

VD

R

Nea

rest

ZC

HJH

SC

HJH

TT

TH

P��

XL

AY

BU

Dis

tan

ce2.

321.

560.

862.

211.

651.

492.

262.

031.

742.

36��

3.49

3.8

3.29

3.

Spec

tru

mW

arp

ed-t

wic

eL

P

Nea

rest

ZC

HJH

SC

HJH

KT

TH

P��

XL

OW

BU

Dis

tan

ce2.

461.

580.

872.

261.

781.

52.

382.

091.

722.

37��

3.22

3.47

3.06

2.

Spec

tru

mW

arp

ed-t

wic

eM

VD

R

Nea

rest

ZC

HJH

SC

HJH

TT

TH

P��

XL

OW

BU

Dis

tan

ce2.

431.

60.

852.

241.

751.

582.

352.

081.

742.

35��

3.26

3.59

3.1

2.

6.3. Phoneme confusability

We investigate the confusability between phonemes bycalculating the minimum distances, on the final features,between different phoneme pairs. In order to account forthe range of variability of the sample points in both pho-neme classes Xp and Xq, expressed by the covariance matri-ces Rp and Rq, we extend the well known Mahalanobisdistance by a second covariance matrix

Dp;q ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðlp � lqÞ

T Rp þ Rq

� ��1ðlp � lqÞq

:

Here lp denotes the sample mean of phoneme class Xp andlq denotes the sample mean of phoneme class Xq,respectively.

As the comparison of the confusion matrix itself wouldbe impractical, we limit our investigations on the compar-ison of the distance between the nearest phoneme to a givenphoneme for different spectral estimation techniques asplotted in Table 5. Note that the PLP front-end is excludedfrom this analysis as it, due to a different scale, can not bedirectly compared. By comparing the nearest phonemepairs over different phonemes and spectral estimationmethods we observe that different spectral representationsresult in slightly different phoneme pairs. In addition weobserve that, in average, phonemes with a small value ofu are easier confused (smaller distance) with other pho-nemes than phonemes with a high u value. This can beexplained by the energy of the different phoneme classeswhere the phoneme classes belonging to small u valuescontain less energy and are thus stronger distorted by back-ground noise.

Comparing the power spectrum with the warpedMVDR envelope we observe that the power spectrumtends to provide lower confusability for lower u valuesand higher confusability for higher u values. Thewarped-twice LP and warped-twice MVDR envelopes havea similar distance structure over u, with in average largerdistances for the warped-twice MVDR envelopes. Whilethe warped-twice MVDR envelope, compared to the


warped MVDR envelope, provides a lower confusabilityfor small values of u, the confusability is higher for largervalues of u. While the warped MVDR envelope is notcapable to provide a lower confusability over the wholerange of u in comparison to the power spectrum, thewarped-twice MVDR envelope provides, in average, alower confusability over the whole range of u in compari-son to the power spectrum.

7. Conclusion

We have introduced warped-twice MVDR spectral esti-mation by extending warped MVDR estimation with a sec-ond bilinear transformation. With these extensions, it ispossible to steer spectral resolution to lower or higher fre-quencies while keeping the overall resolution of the esti-mate and the frequency axis fixed. We have demonstratedone possible application in the front-end of a speech-to-textsystem by steering the resolution of the spectral envelope toclassification relevant spectral regions. The proposedframework showed consisted improvements in terms ofclass separability and WER on a large vocabulary speechrecognition task on close talk as well as on distant speechrecordings. Further improvements might be expected by amore suitable steering function.

References

Acero, A., 1990. Acoustical and Environmental Robustness in AutomaticSpeech Recognition, Ph.D. Thesis. Carnegie Mellon University.

Braccini, C., Oppenheim, A.V., 1974. Unequal bandwidth spectralanalysis using digital frequency warping. IEEE Trans. Acoust. SpeechSignal Process. 22, 236–244.

Capon, J., 1969. High-resolution frequency-wavenumber spectrum anal-ysis. Proc. IEEE 57, 1408–1418.

Computers in the Human Interaction Loop. <http://chil.server.de>.Davis, S., Mermelstein, P., 1980. Comparison of parametric representa-

tions for monosyllable word recognition in continuously spokensentences. IEEE Trans. Acoust. Speech Signal Process. 28 (4), 357–366.

Dharanipragada, S., Rao, B., 2001. MVDR based feature extraction forrobust speech recognition. Proc. ICASSP, 309–312.

Dharanipragada, S., Yapanel, U., Rao, B., 2007. Robust feature extrac-tion for continuous speech recognition using the MVDR spectrumestimation method. IEEE Trans. Speech Audio Process. 15 (1), 224–234.

Driaunys, V., Rudzionis, K., Zvinys, P., 2005. Analysis of vocal phonemesand fricative consonant discrimination based on phonetic acousticsfeatures. Inform. Technol. Contr. 34 (3), 257–262.

Gales, M.J.F., 1998. Maximum likelihood linear transformations forHMM-based speech recognition. Comput. Speech Language 12, 75–98.

Gales, M., 1999. Semi-tied covariance matrices for hidden Markovmodels. IEEE Trans. Speech Audio Process. 7, 272–281.

Haeb-Umbach, R., 1999. Investigations on inter-speaker variability in thefeature space. Proc. ICASSP, 397–400.

Harma, A., Laine, U., 2001. A comparison of warped and conventionallinear predictive coding. IEEE Trans. Speech Audio Process. 9 (5),579–588.

Haykin, S., 1991. Adaptive Filter Theory, third ed. Prentice Hall.Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of

speech. J. Acoust. Soc. Amer. 87 (4), 1738–1752.Linguistic Data Consortium (LDC), Translanguage English Database.

<www.ldc.upenn.edu/Catalog/LDC2002S04.html>.Leggetter, C., Woodland, P., 1995. Maximum likelihood linear regression

for speaker adaptation of continuous density hidden Markov models.Comput. Speech Language 9 (2), 171–185.

Makhoul, J., 1975. Linear prediction: a tutorial review. Proc. IEEE 63 (4),561–580.

Matsumoto, H., Moroto, M., 2001. Evaluation of mel-LPC cepstrum in alarge vocabulary continuous speech recognition. Proc. ICASSP, 117–120.

Matsumoto, M., Nakatoh, Y., Furuhata, Y., 1998. An efficient mel-LPCanalysis method for speech recognition. Proc. ICSLP, 1051–1054.

Mesgarani, N., David, S., Shamma, S., 2007. Representation of phonemesin primary auditory cortex: how the brain analyzes speech. Proc.ICASSP, 765–768.

Murthi, M., Rao, B., 1997. Minimum variance distortionless response(MVDR) modeling of voiced speech. Proc. ICASSP, 1687–1690.

Murthi, M., Rao, B., 2000. All-pole modeling of speech based on theminimum variance distortionless response spectrum. IEEE Trans.Speech Audio Process. 8 (3), 221–239.

Musicus, B., 1985. Fast MLM power spectrum estimation from uniformlyspaced correlations. IEEE Trans. Acoust. Speech Signal Process. 33,1333–1335.

Nakatoh, Y., Nishizaki, M., Yoshizawa, S., Yamada, M., 2004. Anadaptive mel-LP analysis for speech recognition. Proc. ICSLP.

NIST, 2005. Rich Transcription 2005 Spring Meeting RecognitionEvaluation. <www.nist.gov/speech/tests/rt/rt2005/spring>.

Nocerino, N., Soong, F., Rabiner, L., Klatt, D., 1985. Comparative studyof several distortion measures for speech recognition. Proc. ICASSP,25–28.

Olive, J., 1993. Acoustics of American English Speech: A DynamicApproach. Springer.

Oppenheim, A., Schafer, R., 1989. Discrete-Time Signal Processing.Prentice-Hall Inc.

Oppenheim, A., Johnson, D., Steiglitz, K., 1971. Computation of spectrawith unequal resolution using the fast Fourier transform. IEEE Proc.Lett. 59 (2), 229–301.

Smith III, J.O., Abel, J.S., 1999. Bark and ERB bilinear transforms. IEEETrans. Speech Audio Process. 7 (6), 697–708.

Stevens, S., Volkman, J., Newman, E., 1937. The mel scale equates themagnitude of perceived differences in pitch at different frequencies.J. Acoust. Soc. Amer. 8 (3), 185–190.

Strube, H., 1980. Linear prediction on a warped frequency scale.J. Acoust. Soc. Amer. 68 (8), 1071–1076.

Welling, L., Ney, H., Kanthak, S., 2002. Speaker adaptive modeling byvocal tract normalization. IEEE Trans. Speech Audio Process. 10 (6),415–426.

Wolfel, M., 2003. Mel-frequenzanpassung der minimum varianz distor-tionless response Einhullenden. Proc. ESSV, 22–29.

Wolfel, M., 2006. Warped-twice minimum variance distortionless responsespectral estimation. Proc. EUSIPCO.

Wolfel, M., McDonough, J., 2005. Minimum variance distortionlessresponse spectral estimation, review and refinements. IEEE SignalProcess. Mag. 22 (5), 117–126.

Wolfel, M., McDonough, J., Waibel, A., 2003. Minimum variancedistortionless response on a warped frequency scale. Proc. Eurospeech,1021–1024.

Yule, G.U., 1927. On a method of investigating periodicities in disturbed,series with special reference to Wolfers sunspot numbers. Philos.Trans. Royal Soc. 226A, 267–298.

http://chil.server.de

http://www.ldc.upenn.edu/Catalog/LDC2002S04.html

http://www.nist.gov/speech/tests/rt/rt2005/spring

Signal adaptive spectral envelope estimation for robust speech recognition

Documents

Transcript of Signal adaptive spectral envelope estimation for robust speech recognition