Data-driven environmental compensation for speech recognition: A unified approach

Data-driven environmental compensation for speechrecognition: A uni®ed approach

Pedro J. Moreno *, Bhiksha Raj, Richard M. Stern

Department of Electrical and Computer Engineering & School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213,

USA

Received 26 July 1996; received in revised form 3 November 1997; accepted 8 May 1998

Abstract

Environmental robustness for automatic speech recognition systems based on parameter modi®cation can be accom-

plished in two complementary ways. One approach is to modify the incoming features of environmentally-degraded

speech to more closely resemble the features of the (normally undegraded) speech used to train the classi®er. The other

approach is to modifying the internal statistical representations of speech features used by the classi®er to more closely

resemble the features representing degraded speech in a particular target environment. This paper attempts to unify

these two approaches to robust speech recognition by presenting several techniques that share the same basic assump-

tions and internal structure while di�ering in whether they modify the features of incoming speech or whether they mod-

ify the statistics of the classi®er itself. We present the multivaRiate gAussian-based cepsTral normaliZation (RATZ)

family of algorithms which modify incoming cepstral features, along with the STAR (STAtistical Reestimation) family

of algorithms, which modify the internal statistics of the classi®er. Both types of algorithms are data driven, in that they

make use of a certain amount of adaptation data for learning compensation parameters. The algorithms were evaluated

using the SPHINX-II speech recognition system on subsets of the Wall Street Journal database. While all algorithms

demonstrated improved recognition accuracy compared to previous algorithms, the STAR family of algorithms tended

to provide lower error rates than the RATZ family of algorithms as the SNR was decreased. Ó 1998 Elsevier Science

B.V. All rights reserved.

ReÂsumeÂ

Dans les syst�emes de reconnaissance de la parole, la robustesse �a l'environnement par adaptation des param�etres

peut etre obtenue de deux facËons compl�ementaires. Une premi�ere approche consiste �a modi®er les param�etres acousti-

ques de la parole d�egrad�ee par l'environnement de facËon �a ressembler aux param�etres de la parole (habituellement non

d�egrad�ee) qui a �et�e utilis�ee lors de l'entraõnement. La deuxi�eme solution est de modi®er les param�etres statistiques in-

ternes au reconnaisseur de facËon �a mieux repr�esenter les caract�eristiques de la parole d�egrad�ee dans un environnement

cible particulier. Le pr�esent papier tente d'uni®er ces deux approches de reconnaissance robuste de la parole en pr�esen-

tant plusieurs techniques qui partagent les memes hypoth�eses de base et la meme structure, tout en di��erant dans le

choix de savoir si elles modi®ent les param�etres d'entr�ee ou les param�etres statistiques du reconnaisseur. Nous pr�esen-

tons ici la famille d'algorithmes bas�es sur la normalisation cepstrale gaussienne multi-variable (RATZ) qui modi®ent

les caract�eristiques cepstrales d'entr�ee, ainsi que les algorithmes STAR (re-estimation statistique), qui modi®ent les

Speech Communication 24 (1998) 267±285

* Corresponding author. Current address: Speech Interaction Technology Group, Cambridge Research Laboratory, Digital

Equipment Corporation, One Kendall Square, Bldg. 700, Cambridge, MA 02139-1562, USA. Tel.: +1 617 692 7692; fax: +1 617 692

7650; e-mail: [email protected].

0167-6393/98/$19.00 Ó 1998 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 2 5 - 9

param�etres internes du reconnaisseur. Les deux types d'algorithmes sont bas�es sur les donn�ees et utilisent une certaine

quantit�e de donn�ee d'adaptation pour estimer les param�etres de compensation. Les algorithmes ont �et�e �evalu�es en ut-

ilisant le syst�eme de reconnaissance SPHINX-II sur un sous-ensemble de la base de donn�ee Wall Street Journal. Bien

que tous les algorithmes conduisaient une am�elioration des performances en comparaison des algorithmes pr�ec�edents,

la famille d'algorithmes STAR donnait g�en�eralement des taux d'erreur plus faibles que la famille d'algorithmes RATZ

lorsque le rapport signal/bruit diminuait. Ó 1998 Elsevier Science B.V. All rights reserved.

Keywords: Robust speech communication; Environmental compensation; Data-driven algorithms

1. Introduction

As speech recognition technology is transition-ing from the laboratory to practical applicationsthe importance of environmental robustness inspeech recognition is becoming increasingly appre-ciated. Many research groups have describedalgorithms that achieve a greater degree of environ-mental robustness using a variety of techniques.For example, Acero and Stern (1990), Varga andMoore (1990), Ephraim (1992), Gales and Young(1993), Leggeter and Woodland (1995), andSankar and Lee (1995) have all described compen-sation strategies that involve the use of a para-met-ric model of degradation, combined with optimalestimation of the parameters of the model. Otherapproaches to robustness include the use of periph-eral features based on models of human perceptionof speech signals (e.g. Ghitza, 1986; Sene�, 1988;Hermansky, 1990; Hermansky and Morgan,1994), as well as the use of arrays of microphonesto separate noise sources from speech signal arriv-ing from di�erent spatial locations (e.g. Flanaganet al., 1985; Sullivan and Stern, 1993). The reviewsof Juang (1991) and Stern et al. (1996), and Junquaand Haton (1996) are among a number of recentoverviews of the robustness ®eld.

In this paper we consider compensation meth-ods that achieve environmental robustness by di-rectly observing the e�ects of environmental de-gradation without the use of structural assump-tions about the nature of the degradation. Thesemodi®cations are generally based on empiricalcomparisons of parameters derived from high-quality speech used to train the system to parame-ters derived from degraded speech sampled in thetarget environment. Some of these techniques haveattempted to address the problem by applying

compensation factors to the incoming cepstrumvectors (e.g. Acero and Stern, 1990; Liu et al.,1994; Neumeyer and Weintraub, 1994; Morenoet al., 1996) while other techniques have modi®edparameters characterizing the statistical model ofspeech sounds such as the means and covariancesof the distributions of HMMs (e.g. Gales, 1995;Sankar and Lee, 1995). However, both kind of ap-proaches have usually been presented as separateand distinct techniques.

In this paper we present a series of data-drivenempirical compensation methods that use the sameuni®ed mathematical scheme to compensate boththe incoming data and the internal statistical rep-resentations in the classi®er. It thus becomes possi-ble to either modify the incoming cepstral vectorsor modify the parameters (means and variances)of the distributions of the HMMs using the samebasic mathematical formulation of the problem.

The compensation techniques described in thispaper are not constrained to operate on any par-ticular feature set and could be applied to any fea-ture representation such as cepstra, LPC-cepstraor even auditory-based representations. However,all the experiments described in this paper wereperformed on cepstral vectors.

In Section 2 of this paper we make some obser-vations about the e�ects of the environment on thedistributions of log spectra of clean speech by con-sidering simulations using arti®cially-generateddata. We also discuss the reasons for the degrada-tion in recognition accuracy introduced by the en-vironment. In Section 3 we develop the uni®edmathematical scheme for our data-driven environ-mental compensation methods that can modifyeither the incoming feature vectors of degradedspeech or the internal distributions characteriz-ing clean speech. In Section 4 we present the

268 P.J. Moreno et al. / Speech Communication 24 (1998) 267±285

multivaRiate gAussian-based cepsTral normaliZa-tion (RATZ) family of algorithms, which providescompensation of the incoming features. We des-cribe in detail the mathematical structure of the al-gorithms, and we present experimental results thatexplore some of the dimensions of the algorithm.In Section 5 we present the STAR (STAtistical Re-estimation) family of algorithms, which modify theinternal statistics of the classi®er. We describe thealgorithms and present experimental results. Wepresent comparisons of the STAR and RATZ al-gorithms and we show that STAR compensationresults in greater recognition accuracy. Finally, inSection 6 we summarize our ®ndings.

2. E�ects of the environment on distributions of

clean speech

``Environment'' in the context of speech recogni-tion systems could include such varied e�ects aslow quality microphones, non-linearities on a tele-phone network, line noise, reverberation, and com-peting speakers. It is therefore impossible toenumerate each of these conditions and analyzetheir e�ects on the distributions of speech individu-ally. For this reason we generically model the ef-fects of the environment on the distribution ofclean speech as additive terms to the means andvariances of the distributions of clean speech. Tojustify this approach we present the very simple ex-ample of the e�ect of a linear time-invariant chan-nel and additive uncorrelated noise on the log-spectra of speech. Although our example is present-ed in terms of log-spectral features, the approachesdiscussed in this paper are general and work in

principle for any types of features. (Cepstral fea-tures in particular are related to log-spectral fea-tures through a linear transformation, so similarresults will directly apply for cepstral features aswell.) Finally, using this approach we proposetwo generic solutions to the problem of robustness.

2.1. The e�ect of linear ®ltering and additive noise

A reasonable model for the environment for thecase of linear ®ltering and additive noise is given inFig. 1. This representation was proposed by Acero(1991) and later used by Gales (1995), amongmany other researchers. The e�ect of the noiseand ®ltering on clean speech in the power spectraldomain can be represented as:

PY �xk� � jH�xk�j2PX �xk� � PN�xk�; �1�where PY �xk� represents the power spectra of thedegraded speech, y[m], PN �xk� represents the pow-er spectra of the noise n[m], PX �xk� represents thepower spectra of the clean speech x[m], jH�xk�j2represents the squared magnitude of the transferfunction of the ®lter with impulse response h[m]representing channel e�ects, and xk represents aparticular mel-spectral band. Our assumptionsare that the noise is stationary and uncorrelatedwith the speech signal, and that the channel is timeinvariant and independent of the signal level.

Transforming to the log-spectral domain, weobtain:

log�PY �xk�� log�jH�xk�j2PX �xk� � PN �xk��: �2�De®ning the log spectra of degraded speech, noise,clean speech, and the log of the squared magnitudeof the transfer function of the ®lter as

Fig. 1. A model of the environment for additive noise and ®ltering by a linear channel. x[m] represents the clean speech signal, n[m]

represents the additive noise and y[m] represents the resulting degraded speech signal. The function h[m] represents the e�ects of a linear

channel.

P.J. Moreno et al. / Speech Communication 24 (1998) 267±285 269

~Y �k� � log�PY �xk��;~N �k� � log�PN �xk��;~X �k� � log�PX �xk��;~H �k� � log�jH�xk�j2�;

�3�

we obtain the following equations

~Y �k� � ~X �k� � ~H �k�� log�1� exp� ~N �k� ÿ ~X �k� ÿ ~H �k��: �4�

This expression can be shortened to

~Y �k� � ~X �k� � f � ~X �k�; ~H �k�; ~N �k�� 5�or in vector form

y � x� f �x; h; n�; �6�where

f � ~X �k�; ~H �k�; ~N �k�� ~H �k� � log�1� exp� ~N �k� ÿ ~X �k� ÿ ~H �k��: �7�Let us assume that the log-spectral vectors that

characterize clean speech follow a Gaussian distri-bution, Nx�lx;Rx� and that the noise and channelare perfectly known. Eq. (7) provides a transfor-mation of random variables leading to a new dis-tribution for the log spectra of degraded speechequal to

p�yjlx;Rx; n; h� � ��2p�DjRxj�1=2j�In

ÿdiag�enÿy��joÿ1

exp1

2�yÿ hÿ lx � log�i

�ÿenÿy��TRÿ1

x �y ÿ hÿ lx � log�iÿ enÿy��; �8�

where D is the dimensionality of the randomvariable that characterizes the log spectrum, i isthe unitary vector, I is the identity matrix, anddiag(enÿy) is the diagonal matrix whose ith diago-nal element is the ith element of enÿy.

The resulting distribution p(y)is non-Gaussian.However, since a Gaussian distribution assigned top(y) can still capture part of the e�ect of the chan-nel and noise, we will continue to assume Gaussiandistributions. To characterize the Gaussian distri-butions we need only compute the mean vectorand covariance matrices of these new distributions.The new mean vector can be computed as:

ly � lx �ZX

f �x; h; n�Nx�lx;Rx� dx �9�

and the covariance matrix can be computed as:

Ry � E�xxT� � E�f �x; h; n�f �x; h; n�T�� 2E�xf �x; h; n�T� ÿ lyl

Ty

� E�xxT� �ZX

�2x� f �x; h; n��

� f �x; h; n�TNx�lx;Rx� dxÿ lylTy : �10�

The integrals in Eqs. (9) and (10) do not haveclosed-form solutions, so numerical methods mustbe used to estimate the mean vector and covari-ance matrix of the distribution.

Unfortunately, the previous assumptions aboutthe noise are too simple. In most cases the noisemust be estimated and is not known a priori. Amore realistic model would be to assign a Gauss-ian distribution Nn�ln;Rn� to the noise. To simplifythe resulting equations we can also assume that thenoise and the speech are statistically independent.Under these assumptions the mean vector andthe covariance matrix of the log spectrum of thedegraded speech will have the form

ly � lx �ZX

Nx�lx;Rx�ZN

f �x; h; n�Nn�ln;Rn� dx dn;

�11�

Ry � Rx � lxlTx

�ZX

Nx�lx;Rx�ZN

�f �x; h; n�f �x; h; n�T�

� Nn�ln;Rn� dn dx�ZX

Nx�lx;Rx�ZN

�xf �x; h; n�T

� f �x; h; n�xT� � Nn�ln;Rn� dn dxÿ lylTy :

�12�These equations also have no closed-form solu-tion, and we must estimate the resulting mean vec-tor and covariance matrix through numericalmethods. Gales (1995) presents several approxima-tions to solve these equations.


2.2. One-dimensional simulations of noise-corruptedGaussian data

To help visualize the resulting distributions ofdegraded data we plot arti®cially-produced one-di-mensional simulations of Gaussian random vari-ables corrupted by additive noise. These plotssimulate the simpli®ed case of a one-dimensionallog spectral feature vector of speech. Arti®cialclean data were produced according to the Gauss-ian distribution Nx�lx; r

2n� and contaminated with

arti®cially-produced Gaussian noise with the dis-tribution Nn�ln; r

2n� and a channel h. The arti®cial-

ly-produced clean data, noise, and channel werecombined according to Eq. (4) producing the de-graded data set y0; y1; . . . ; yNÿ1f g . From this de-graded data set we directly estimated the meanand variance of a maximum likelihood (ML) ®t as:

ly;ML �1

N

XNÿ1

i�0

yi;

r2y;ML �

1

N

XNÿ1

i�0

�yi ÿ ly;ML�2: �13�We also computed a histogram of the degraded

data to estimate its real distribution. The contam-ination was performed at di�erent signal-to-noiseratios (SNRs) de®ned by lx ) ln. Fig. 2 showsan example of the original distribution of the orig-

inal clean parameter x (solid curve), the corruptedvalue of that parameter y after going through thetransformation of Eq. (4) that produces the distri-bution of Eq. (8) (dashed curve), and the bestGaussian ®t to that distribution (dotted curve).The SNR of the noisy signal is 3 dB.

We observe that while the resulting probabilitydensity function (PDF) is non-Gaussian and some-times bimodal (as in Fig. 2), a Gaussian ®t to thePDF captures some of the e�ects of the environ-ment on the clean signal. The e�ect on the param-eters of the PDF can be reasonably accuratelymodeled as a shift in the mean of the PDFs anda decrease in the variance of the resulting PDF. In-tuitively, the shift in the mean occurs mainly be-cause of the e�ect of the channel on the cleanspeech signal. The compression in variance occursbecause of the logarithmic transformation beingapplied on the spectrum: the lower region of thedistribution of spectral power gets shifted morethan the upper region, thereby causing the overalldensity to shrink in width. Note, however, that thiscompression of the variance will occur only if thevariance of the distribution of the noise is smallerthan the variance of the distribution of the cleansignal. This change in variance can be representedby an additive factor in the covariance matrix.

2.3. Modeling the e�ects of the environment ascorrection factors

In summary, the one-dimensional simulationsof the e�ects of noise on the data imply the follow-ing e�ects· The PDF of the degraded signal is non-Gauss-

ian, and may exhibit a bimodal shape.· The PDF of the signal shifts according to the

SNR.· The PDF of the signal is compressed if Rx > Rn

or expanded if Rx < Rn. The amount of com-pression or expansion depends on the SNR.Since most speech recognition systems model the

statistics of speech as mixtures of Gaussian distri-butions, it is still convenient to continue modelingthe resulting PDFs of degraded speech as Gaussian.A simple way to achieve this is by (1) modeling themean of the distribution of degraded speech as themean of the clean signal plus a correction vector

Fig. 2. Estimates of the distribution of degraded Gaussian data

via Monte Carlo simulation. The solid line represents the PDF

of the clean signal. The dashed line represents the actual PDF

of the noise-contaminated signal. The dotted line represents

the best Gaussian approximation to the PDF of the degraded

signal. The original clean signal had a mean of 5.0 and a vari-

ance of 3.0, and the parameter representing the channel e�ect

was set equal to 5.0. The mean of the noise was 7.0 and its vari-

ance was 0.5.


ly � lx � r �14�and (2) by modeling the covariance matrix of thedistribution of degraded speech as the covariancematrix of the clean speech plus a correction matrix

Ry � Rx � R: �15�The R matrix will be symmetric and will have pos-itive or negative elements according to the value ofthe covariance matrix of the noise compared withthat of the clean signal. It must also be of a formthat guarantees that the resulting Ry matrix re-mains a valid covariance matrix. This approachwill be used for the algorithms presented through-out this paper.

3. A uni®ed view of data-driven environment com-

pensation

In Section 2 we have suggested that the e�ect ofthe environment on the distributions of the logspectra or cepstra of clean speech can be modeledby additive correction factors applied to the meanvectors and covariance matrices. In this section wepresent techniques that attempt to learn the pa-rameters that characterize these e�ects directlyfrom sample data, by comparing sets of degradedand clean vectors. This approach does not explic-itly assume any model of the environment, but us-es empirical observations to infer environmentalcharacteristics.

We ®rst introduce a uni®ed view of environ-mental compensation and then provide solutionsfor the compensation factors. We then particular-ize these solutions for the case of adaptation data-sets that were simultaneously recorded using cleanspeech and degraded speech from the target envi-ronment (``stereo'' data). Subsequently we particu-larize the generic solutions for two family oftechniques; the Multivariate Gaussian Based Ceps-tral Normalization (RATZ) techniques (Moreno etal., 1995a, b) and the Statistical Reestimation(STAR) techniques (Moreno et al., 1995b; More-no, 1996). The performance of these techniquesfor several databases and experimental conditionsis then explored.

3.1. Basic assumptions

We model the distribution of the tth vector xt ofa cepstral vector sequence of length T, X �

x1; x2; . . . ; xTf g, generically as:

p�xt� �Xk

k�1

ak�t�Nx�lx;k;Rx;k�: �16�

In other words, p(xt)is modeled as a summation ofK Gaussian components with a priori probabilitiesthat are time dependent. Assuming that each vec-tor xt is independent and identically distributed(i.i.d.), the overall likelihood for the full observa-tion sequence X becomes

l�X� �YT

t�1

p�xt�

�YT

t�1

Xk

ak�t�Nx�lx;k;Rx;k�: �17�

The above likelihood equation o�ers a doubleinterpretation, depending on whether the compen-sation is to be performed by modfying the incom-ing features or by modifying the internal statisticalrepresentation used by the recognition system. Forconvenience, we refer to compensation methodsthat modify incoming features as ``data compensa-tion methods'' and compensation methods thatmodify internal statistical representations as``model compensation methods''. In the sectionsbelow we describe two separate but closely-relatedfamilies of algorithms for environmental compen-sation: the multivaRiate gAussian-based cepsTralnormaliZation (RATZ) family of algorithms,which provides data compensation, and the STAR(STAtistical Reestimation) family of algorithms,which provides model compensation.

For data compensation methods, we set the apriori probabilities ak�t� of the equations aboveto be independent of t. This de®nes a conventionalmixture of Gaussian distributions for the entiretraining set of cepstral vectors.

The interpretation is slightly di�erent for modelcompensation methods. We assume that the ceps-tral speech vectors are emitted by an HMM withK states in which each state emission PDF is


composed of a single Gaussian. In this case theak�t� terms de®ne the probability of being in statek at time t and follow the state probabilities of aMarkov chain de®ned by the state initial probabil-ities and the transition matrix of the HMM. Underthese assumptions the expression of the likelihoodfor the full observation sequence is exactly as ex-pressed in Eq. (17).

The assumption of a single Gaussian per state isnot limiting at all. Speci®cally, any state with amixture of Gaussians for emission probabilitiescan also be represented by multiple states wherethe output distributions are single Gaussians andwhere the incoming transition probabilities of thestate are the same as the a priori probabilities ofthe Gaussians ak�t� and the exiting transition prob-ability is unity. Fig. 3 illustrates this idea for a spe-ci®c arbitrary state.

The probabilities ak�t� depend only on the Mar-kov chain topology and are represented by

a�t� � a1�t�a2�t� . . . ak�t�� T � Atp; �18�where At represents the transition matrix after ttransitions and p represents the initial state proba-bility vector of the HMM. The terms Nx�lx;k;Rx;k�of Eq. (16) refer to the Gaussian densities associat-ed with each of the K states of the HMM.

As we have mentioned before, the e�ects ofnoise and linear ®ltering on the mean vectors andcovariance matrices can be expressed as

ly;k � rk � lx;k; �19�

Ry;k � Rk � Rx;k; �20�where rk and Rk represent the corrections appliedto the mean vector and covariance matrix respec-

tively of the kth Gaussian. These two correctionfactors account for the e�ects of the environmenton the distributions of the cepstra of clean speech.The ®rst step of the RATZ and STAR algorithmsis to estimate these correction factors.

3.2. Solutions for the correction factors rk and Rk

Historically, the development of empirical data-driven environmental compensation algorithmshas been motivated and accelerated by the collec-tion and dissemination of databases containingspeech that is simultaneously recorded in high-quality ``clean'' environments and in degraded tar-get environments (Liu et al., 1994). In the RATZand STAR family of algorithms, approaches to so-lutions for the correction factors rk and Rk will de-pend on whether or not such ``stereo'' data areavailable in a speci®c task. We ®rst describe the ge-neric solution for the case in which only samples ofdegraded speech are available, which we refer to asthe ``blind'' case. We then describe how to partic-ularize these solutions for the case when stereo da-ta are in fact available.

We make extensive use of the EM algorithm(Dempster et al., 1977; Huang et al., 1993) for es-timating the correction factors rk and Rk. In thispaper we describe how the EM algorithm can beapplied to solve for the correction parameters rk

and Rk.

3.2.1. Non-stereo-based solutionsWe begin with an observed set of T degraded

vectors Y � y1; y2; . . . ; yTf g, assuming that thesevectors have been produced by a probability den-sity function

Fig. 3. A state with a mixture of three Gaussians is equivalent to a set of three states, each with a single Gaussian where the transition

probabilities are the a priori probabilities of each of the Gaussian mixtures. The constant A represents the transition probability into

the state.


p�yt� �XK

k�1

ak�t�Ny�ly;k;Ry;k� �21�

which is a summation of K Gaussians where eachcomponent relates to the corresponding kthGaussian of clean speech according to Eq. (19).We de®ne a likelihood function l(Y) as

l�Y� �YT

t�1

p�yt�

�YT

t�1

Xk

ak�t�Ny�ly;k;Ry;k�: �22�

We can also express l(Y) in terms of the originalparameters of clean speech and the correctionterms rk and Rk

l�Y� � l�Y jr1; . . . ; rk;R1; . . . ;RK� �YT

t�1

p�yt�

�YT

t�1

Xk

ak�t�Ny�rk � lx;k;Rk � Rx;k�: �23�

For convenience we express the above equationin the logarithm domain de®ning the log likelihoodL(Y) as

L�Y� � log�l�Y�� XT

t�1

log�p�yt��

�XT

t�1

logX

k

ak�t�Ny�rk � lx;k;Rk � Rx;k� !

:

�24�

Our goal is to ®nd the complete set of K termsrk and Rk that maximize the likelihood (or log like-lihood). As it turns out there is no direct solutionto this problem and some indirect method is neces-sary. The expectation-maximization (EM) algo-rithm is one of these methods.

In Appendix A we outline the solution for thecorrection terms rk and Rk. The ®nal solutions ofthe maximum likelihood equation are:

�rk �PT

t�1 P �st�k�jyt;/�ytPTt�1 P �st�k�jyt;/�

ÿ lx;k; �25�

�Rk

�PT

t�1 P �st�k�jyt;/��yt ÿ lx;k ÿ �rk��yt ÿ lx;k ÿ �rk�T�PTt�1 P �st�k�jyt;/�

ÿ Rx;k; �26�where / represents the set of parameters rk and Rk

learned in the previous iteration. Eqs. (25) and(26) form the basis for an iterative algorithm.The EM algorithm guarantees that each iterationincreases the likelihood of the observed data.

3.2.2. Stereo-based solutionsWhen simultaneously-recorded clean and de-

graded speech data (``stereo'' data) are available,the information about the environment is encodedin the stereo pairs. By observing how each cleanspeech vector is transformed into a degradedspeech vector we can learn the correction factorsmore directly.

We can readily assume that the a posterioriprobabilities P �st�k�jyt;/� can be directly estimatedby P �st�k�jxt�. This is equivalent to assuming thatthe probabilities of a vector being produced byeach of the underlying classes do not change dueto the environment. We call this assumption a pos-teriori invariance. This assumption, although notstrictly correct, seems to be a good approximation.

If we expand the P �st�k�jyt;/� and P �st�k�jxt�terms we obtain

P �st�k�jyt;/� �P �st�k��p�ytjst�k�;/�PKj�1 p�ytjst�j�;/�P �st�j��

�27�

and

P �st�k�jxt� � P �st�k��p�xtjst�k��PKj�1 p�xtjst�j��P �st�j��

: �28�

For the above two expressions to be equal, each ofthe terms in the summation must be equal. Thiswould imply that each Gaussian is shifted by ex-actly the same amount and not compressed. How-ever, at high SNRs the shift for each Gaussian isquite similar and the compression in the variancesis almost zero. Therefore, at high SNRs the as-sumption of a posteriori invariance is approxi-mately valid and at lower SNRs it is less valid.In addition, this assumption avoids the need to it-erate Eqs. (25) and (26).


This leads to

�rk �PT

t�1 P �st�k�jxt��yt ÿ lx;k�PTt�1 P �st�k�jxt�

�PT

t�1 P �st�k�jxt��yt ÿ xt � xt ÿ lx;k�PTt�1 P �st�k�jxt�

; �29�

�rk �PT

t�1 P �st�k�jxt��xt ÿ lx;k�PTt�1 P �st�k�jxt�

�PT

t�1 P �st�k�jxt��yt ÿ xt�PTt�1 P �st�k�jxt�

: �30�

The expected value of the ®rst term on the left-hand side of Eq. (30) is zero. We therefore approx-imate this term to be zero. This results in

�rk �PT

t�1 P �st�k�jxt��yt ÿ xt�PTt�1 P �st�k�jxt�

: �31�

Similarly, we obtain

�Rk �PT

t�1 P �st�k�jxt��yt ÿ xt ÿ �rk��yt ÿ xt ÿ �rk�T�PTt�1 P �st�k�jxt�

ÿ Rx;k: �32�

3.3. Comparison with previous algorithms

The general solutions above can be particular-ized for cepstral compensation (data compensa-tion) or for adaptation of HMMs (modelcompensation). When particularized for data com-pensation these solutions can be directly comparedto ®xed codeword-dependent cepstral normaliza-tion (FCDCN) (Acero, 1991). When particularizedfor HMM adaptation they can be compared to du-al-channel codebook adaptation (DCCA) (Liu etal., 1994), maximum likelihood linear regression(MLLR) (Leggeter and Woodland, 1995), parallelmodel combination (PMC) (Gales, 1995), and sto-chastic matching (Sankar and Lee, 1995).

FCDCN is a data-driven cepstral compensationalgorithm introduced by Acero (1991) and furtherdeveloped by Liu (Liu et al., 1994). In the FCDCNalgorithm the data are represented by a codebookcomputed using vector quantization (VQ). All da-ta belonging to a particular VQ cell of the cleanspeech codebook are assumed to have been gener-

ated by a corresponding cell in the codebook rep-resenting degraded speech. Furthermore, FCDCNassumes that the di�erence between the data in acell of the degraded speech and the correspondingdata for clean speech is the same as the di�erencebetween the codewords representing the degradedspeech cell and the clean speech cell. This can beviewed as a degenerate case of a modi®cation ofthe equations in Section 3.2 in which the posteriorprobabilities of the Gaussians have been roundedo� to 1 or 0 (with the most likely Gaussian gettingan a posteriori probability of 1, the rest 0). Sincethe FCDCN method is quantization based, therecan be no learning of variances.

DCCA is a data-driven HMM adaptation algo-rithm similar to FCDCN and was also introducedby Liu (Stern et al., 1994). DCCA adapts the code-words in the HMM codebooks representing theGaussians using the same principle as FCDCN.It estimates an additive correction term for eachof the codewords in the codebook as the di�erencein the centroids of all observations from cleanspeech belonging to that centroid and the centroidof their degraded counterparts. This correction isthen applied to the codewords before recognition.This can be viewed as a degenerate case of the par-ticularization of the equations to HMM adapta-tion using Eqs. (31) and (32) where the posteriorprobabilities of the Gaussians in the HMM code-books have been quantized to 1 or 0. As in the caseof FCDCN, there is no learning of variances.

MLLR (Leggeter and Woodland, 1995) was ini-tially designed for speaker adaptation but it hasproven to be quite e�ective in the ®eld of environ-mental robustness as well. MLLR models the e�ectof the environment on the means of the statisticsrepresenting clean speech as a linear shift and a ro-tation matrix. In its latest implementation it alsomodels the e�ect of the environment on the covari-ance matrices as a multiplicative matrix. Such amodel of the e�ect of the environment on speechdistributions is clearly more ¯exible that the oneproposed in this paper. The STAR algorithm whenimplemented without stereo data can be thoughtof as a special case of the MLLR algorithm.

PMC (Gales, 1995) attempts to solve Eqs. (11)and (12) using numerical approximations. Themain di�erence between PMC and the algorithms


proposed here is that PMC attempts to obtain ananalytical solution for the correction factors ratherthan learning them directly from data as doesMLLR. This is probably one of the virtues andlimitations of all model-based algorithms likePMC. If the structural assumptions for a givenmodel are not valid in a particular environmentthey can produce erroneous estimates for the cor-rection terms. On the other hand, data-driven al-gorithms like MLLR or the RATZ and STARalgorithms proposed in this paper bypass this lim-itation by learning the correction factors directlyfrom the data.

The Stochastic Matching algorithm (Sankarand Lee, 1995) is the closest in spirit to the STARalgorithm when no stereo data are provided. Someof our solutions are identical to the ones providedby the stochastic matching algorithm. The maindi�erence lies in the ability of the algorithms pro-posed here to take advantage of stereo-recordeddata. As we will show in our experiments, theuse of stereo data provides for much better solu-tions for the correction terms rk and Rk

4. The RATZ family of algorithms

In this section we particularize the general solu-tions described in Section 3 for the case of theMultivariate-Gaussian-Based Cepstral Normaliza-tion (RATZ) family of algorithms which operateon the incoming features. We present an overviewof the algorithms and describe in detail the stepsfollowed in RATZ-based compensation. We des-cribe the general stereo-based and blind versionsof the algorithms. Finally, we compare experimen-tal results using several databases and environ-mental conditions.

4.1. Overview of RATZ and blind RATZ

The RATZ algorithms work in three stages: (1)estimation of the statistics of clean speech, (2) esti-mation of the statistics of degraded speech, and (3)compensation of degraded speech. We describethese steps in detail.

Estimation of the statistics of clean speech. ThePDF for the features of clean speech is modeled

as a mixture of multivariate Gaussian distributions.Under these assumptions the distribution of thecepstral vectors of clean speech can be written as

p�xt� �XK

k�1

akNx�lx;k;Rx;k�; �33�

which is equivalent to Eq. (21) for the case of ak�t�being time independent. The parameters ak, lx;k

and Rx;k represent, respectively, the a priori proba-bilities, mean vector, and covariance matrices ofeach multivariate Gaussian mixture componentk. These parameters are learned through tradition-al maximum likelihood EM methods (Dempsteret xal., 1977). The covariance matrix is assumedto be diagonal.

Estimation of the statistics of degraded speech.We assume that the e�ect of the environment onthe statistics of speech can be accurately modeledby applying the proper correction factors to themean vectors and covariance matrices. If stereodata are not available, we obtain as solutionsEqs. (25) and (26) with P �st�k�jyt;/� equal toP �kjyt;/�, since in this case the PDF for the fea-tures of clean speech is modeled by a mixture ofmultivariate Gaussian distributions.

The term P �kjyt;/� represents the a posterioriprobability of an observed degraded vector yt be-ing produced by Gaussian k. The solutions are it-erative and each iteration guarantees that thelikelihood of the data does not decrease.

If stereo data are available we obtain as solu-tions Eqs. (31) and (32) with P �st�k�jxt� equal toP �kjxt� since, as before, the PDF for the featuresof clean speech is modeled by a mixture of multi-variate Gaussian distributions. In this case the so-lutions are non-iterative.

Compensation of degraded speech. The solutionfor the correction factors r1; . . . ; rK ;R1; . . . ;RKf ghelps us learn the new distributions of cepstral vec-tors of degraded speech. With this knowledge wecan estimate the best correction to apply to eachincoming degraded vector y to obtain an estimatedclean vector x. To do so we use a minimum mean-squared error (MMSE) estimator

xMMSE � E�xjy� �ZX

xp�xjy� dx: �34�


Because this equation requires knowledge of themarginal distribution p(xjy), and because closed-form solutions are frequently di�cult or impossi-ble to obtain, some simpli®cations are needed.We model the e�ect of the environment on cleanspeech as an additive factor r(x) to the clean vec-tor x to give the degraded vector y. Consequently,the clean vector x can be obtained from the de-graded vector y as x � yÿ r�x�. In this caseEq. (34) simpli®es to

xMMSE � yÿZX

r�x�p�xjy� dx

� yÿZX

XK

k�1

r�x�p�x; kjy� dx

� yÿXK

k�1

P �kjy�ZX

r�x�p�xjk; y� dx

� yÿXK

k�1

rkP �kjy�ZX

p�xjk; y� dx

� yÿXK

k�1

rkP �kjy�; �35�

where we have further simpli®ed the r(x) expres-sion to rk. This is equivalent to assuming that ther(x) term can be well approximated by a constantvalue within the region in which p(xjk,y) has a sig-ni®cant value.

4.2. Experimental results

In this section we describe several experimentsdesigned to evaluate the performance of the RATZfamily of algorithms. We explore several of the di-mensions of the algorithms including the impact ofthe number of adaptation sentences on recognitionaccuracy and the optimal number of Gaussianmixtures.

The experiments described here were performedusing the 5000-word Wall Street Journal (WSJ)1993 database (Paul and Baker, 1992). We usethe speaker-independent training corpus, referredto as WSJ0- si_trn, which contains 7240 utterancesfrom 84 speakers reading sentences from the WSJ.These sentences were recorded using a Sennheiser

close-talking noise-cancelling headset. The train-ing utterances were collected from 84 speakers.These data are used to build a single set of gen-der-independent HMMs to train the SPHINX-IIsystem.

The testing set contained 215 sentences with atotal number of 4066 words belonging to 10 di�er-ent native speakers of American English, againreading sentences from the WSJ. These sentenceswere recorded by NIST using a SennheiserHMD-410 or HMD-414 close-talking, headset-mounted noise-cancelling microphone. These datawere contaminated by us with additive whiteGaussian noise at several SNRs. Sentences werecorrupted by adding noise scaled on a sentence-by-sentence basis to an average power value com-puted to produce the required SNR.

For all the experiments using this database, theupper dotted line in the ®gures represents the per-formance of the system when fully trained on de-graded data, which is a reasonable estimate ofthe best possible performance to be expected fora particular recognition system at a given SNR.The lower dotted line represents the performanceof the system when no compensation other thanbaseline cepstral mean normalization (CMN) isapplied.

The SPHINX-II speech recognition engine wasused for our experiments. SPHINX-II is a large-vocabulary, speaker-independent, semicontinuousHidden Markov Model (SC-HMM)-based contin-uous speech recognition system and was one of the®rst systems to demonstrate the feasibility of accu-rate, speaker-independent, large-vocabulary con-tinuous speech recognition (Huang et al., 1991).The feature set used in all the experiments de-scribed in this paper was composed of fourstreams. The ®rst stream contains 12 mel scaledcepstrum components. The second and thirdstreams contain the cepstrum time derivativesand the second time derivatives respectively. The®nal stream contains the logarithm of the frameenergy, its time derivative and its second time de-rivative.

4.2.1. E�ect of the number of adaptation sentencesFig. 4 describes recognition accuracy of the

RATZ algorithm as a function of the number of


adaptation sentences. It can be seen that theRATZ algorithm is able to compensate for the ef-fect of the environment even with a very smallnumber of adaptation sentences. In fact, the per-formance appears to be quite insensitive to thenumber of adaptation sentences except when thenumber of sentences becomes less than 10. As inmany other learning problems, the number of pa-rameters that can be ``adequately'' observed de-pends on the amount of training data available.In this case it appears that 10 sentences are ade-quate. Perhaps some kind of clustering technique,such as the techniques used in MLLR (Leggeterand Woodland, 1995) where the parameters to belearned are tied could help in enabling RATZ tobecome more e�ective with less than 10 sentences.

4.2.2. E�ect of the number of Gaussian mixturesFig. 5 describes the e�ect that the number of

Gaussian mixtures has on recognition accuracyobtained using RATZ. Several con®gurations ofthe algorithm with 256, 64, and 16 Gaussian mix-ture components were implemented with correc-tion factors learned from 100 stereo adaptationsentences for this study. Recognition accuracy isseen to increase as the number of Gaussian mix-tures increases, particularly at lower SNRs.

4.2.3. Stereo based RATZ vs. blind based RATZFig. 6 compares the e�ect of having stereo data

to learn the correction factors on the recognitionaccuracy. 256 Gaussian mixtures were used in im-plementing both algorithms, RATZ and blind

Fig. 5. The e�ect of the number of Gaussians on recognition ac-

curacy of the RATZ algorithms. Results were obtained using

100 adaptation sentences.

Fig. 6. Comparison of recognition accuracy obtained using ste-

reo-based implementations of RATZ vs. the blind RATZ im-

plementation. Results were obtained using 256 Gaussians and

100 adaptation sentences.

Fig. 4. The e�ect of the number of adaptation sentences on recognition accuracy obtained using an implementation of the RATZ al-

gorithm with 128 Gaussians.


RATZ. The adaptation set consisted of the same100 sentences that were used in our previous exper-iments. In the case of RATZ the stereo pairs (cleanand noisy) set was used. Not surprisingly, the ab-sence of stereo data is detrimental to recognitionaccuracy. Nevertheless, even blind RATZ providesconsiderable bene®ts when compared to not per-forming any compensation at all (the CMN case).

4.2.4. Comparisons with FCDCNThe impact of the novel aspects of the RATZ

algorithms can be evaluated by comparison tothe FCDCN family of algorithms introduced byAcero (1991) and studied further by Liu (Liu etal., 1994). FCDCN can be considered to be a par-ticular case of the RATZ algorithms where a VQcodebook is used to represent the statistics of cleanspeech and the e�ect of the environment on thiscodebook is modeled by shifts in the centroids ofthe codebook. In FCDCN, additive correctionsare associated with each of the centroids in thecodebook. All centroids in the codebook are as-sumed to have the same variance. Compensationis performed on each observation vector by sub-tracting the correction factor that minimizes VQdistortion.

Fig. 7 compares recognition accuracy usingRATZ with the corresponding results obtainedusing FCDCN, with each system implemented

using 256 Gaussians. The same set of sentenceswere used to learn the statistics or VQ codebookof clean speech and the same 100 stereo sentenceswere used to learn the correction factors. As can beseen, the RATZ algorithm outperforms FCDCNat all SNRs, and the improvement in recognitionaccuracy is greater at lower SNRs. The improve-ment in recognition accuracy obtained usingRATZ compared to FCDCN is a consequence ofthe more detailed statistical representation usedby RATZ to express the distributions of cleanspeech, and the more detailed model used byRATZ to represent the e�ect of the environmenton the distributions of clean speech. WhileFCDCN models the clean data as a VQ codebook,consequently representing the space with only a setof codewords, RATZ models it as a Gaussian mix-ture PDF, where the variance of the clean speech isalso modeled. Additionally, the e�ect of the envi-ronment on the variance is also modeled, some-thing that is not possible with the representationused by FCDCN.

5. The STAR family of algorithms

In this section we particularize the general solu-tions described in Section 3 for the case of theSTAtistical Reestimation (STAR) family of algo-rithms. While the RATZ algorithms apply correc-tion factors to the incoming cepstral vectors ofdegraded speech, the STAR algorithms modifysome of the parameters of the acoustical distribu-tions in the HMM structure.

We present an overview of the STAR algorithmand describe in detail all the steps followed inSTAR based compensation. We then provide someexperimental results on several databases and envi-ronmental conditions.

5.1. Overview of STAR and blind STAR

The concept of data-driven algorithms thatadapt HMMs to new environments has been intro-duced before. For example, the tied-mixture nor-malization algorithm proposed by Anastasakoset al. (1994) and the Dual Channel Codebook Ad-aptation (DCCA) algorithm proposed by Liu

Fig. 7. Comparison of recognition accuracy obtained using the

RATZ algorithm with recognition accuracy obtained using the

FCDCN algorithm. Both algorithms used a similar con®gura-

tion with 256 Gaussians learned with the same stereo data.

The identity of the environment was also presumed to be

known.


(Stern et al., 1994) are similar in spirit to STAR.However, these previous algorithms are based onVQ indices rather than on Gaussian a posterioriprobabilities, and they use a weaker model of thee�ect of the environment on Gaussian distribu-tions in which the covariance matrices are not cor-rected. Furthermore, they only model the e�ect ofthe environment on cepstrum distributions with-out modeling the e�ect on the other featurestreams such as delta cepstra, double delta cepstraand energy.

The STAR algorithm works in two stages: (1)estimation of the statistics of clean speech, and(2) estimation of the statistics of degraded speech.A formal ``compensation'' stage is not necessarybecause the original cepstral vectors of the degrad-ed speech are used for recognition.

Estimation of the statistics of clean speech. TheSTAR algorithm makes use of the acoustical dis-tributions modeled by the HMMs to representclean speech. (Strictly speaking this is not a step re-lated to the STAR algorithm.) These distributions,as modeled by HMMs, are mixtures of multivari-ate Gaussians. Under these assumptions the distri-bution for clean speech can be written as

p�xt� �XK

k�1

ak�t�Nx�lx;k;Rx;k�; �36�

where the ak�t� term represents the a priori proba-bility of each of the Gaussians of each of the pos-sible states. The lx;k and Rx;k terms represent themean vector and covariance matrix of each multi-variate Gaussian mixture element k. These param-eters are learned through the well-known Baum±Welch algorithm (Baum, 1972).

Estimation of the statistics of degraded speech.As mentioned in Section 3, we assume that the ef-fect of the environment on the distributions ofspeech cepstra can be modeled adequately by ap-plying the proper correction factors to the meanvectors and covariance matrices. Therefore, ourgoal will be to compute these correction factorsto estimate the statistics of degraded speech.

If stereo data are not available, we make use ofEqs. (25) and (26) to obtain the compensationparameters. In this case the term P �st�k�jyt;/� rep-resent the a posteriori probability of an obser-

vation degraded vector yt being produced byGaussian k in state st�k� given the set of estimatedcorrection parameters /. The solutions are itera-tive and each iteration guarantees that the likeli-hood of the adaptation data does not decrease.Notice that in this case the solutions are very sim-ilar to the Baum±Welch reestimation solutionscommonly used for HMM training (Baum, 1972;Huang et al., 1993).

If stereo data are available we make use ofEqs. (31) and (32). The solutions to these equa-tions are non-iterative. Note that in the stereo casethe substitution of P �st�k�jyt� by P �st�k�jxt� assumesimplicitly that the a posteriori probabilities do notchange due to the environment.

In addition to the cepstra, most HMM-basedspeech recognition systems use the di�erence cep-stra and the double di�erence cepstra as parame-ters. The STAR family of algorithms assumesthat all of these streams will be a�ected by the en-vironment by an additive factor to the means andvariances. Even though we do not present evidenceto support this assumption, our experimental re-sults as well as results by Gales (1995) support thisassumption. We estimate each of the additionalcorrection factors using formulae that are equiva-lent to those used to estimate the correction factorsfor the cepstral stream.

Once the correction factors are estimated wecan perform recognition using the distributionsof degraded speech estimated as distributions ofclean speech corrected by the appropriate factors.

5.2. Experimental results

We performed several experiments to evaluatethe performance of the STAR family of algo-rithms. Several dimensions of the algorithm wereexplored, including the impact of the number ofadaptation sentences on speech recognition accu-racy and the e�ect of the availability of stereo datato learn the correction factors. We also comparethe recognition accuracy obtained using the STARalgorithms with the corresponding accuracy ratesobtained using RATZ. The procedures and data-bases used in these experiments are the same ashad been described in Section 4.2.


5.2.1. E�ect of the number of adaptation sentencesFig. 8 shows the e�ect of the number of adapta-

tion sentences on the recognition accuracy ob-tained using STAR as a function of SNR. Theseexperiments were performed using the semi-con-tinuous HMM-based SPHINX-II system with256 Gaussian mixtures. The STAR algorithm ap-pears to capture all the needed information for en-vironmental adaptation with only 40 sentences.

These results are in agreement with the resultswe obtained with the RATZ algorithm. Both algo-rithms require a minimum of 40 utterances to learnthe correction factors. This can be explained con-sidering that the number of parameters to belearned is not very large. Only 256 correction fac-tors for the mean vectors and diagonal covari-ances.

5.2.2. Stereo vs. non-stereo adaptation databasesWhen stereo data are unavailable, the correc-

tion factors must be learned iteratively. We ex-plored di�erent alternatives to bootstrap theiterative learning procedure. Fig. 9 compares rec-ognition accuracy for the original STAR andblind STAR algorithms obtained using 100 adap-tation sentences. Ten iterations of the reestimationformulae were used for the blind STAR experi-ments.

To explore the e�ect that the initial parametervalues have on the blind STAR algorithm, we ini-tialized the reestimation procedure in two ways:(1) with initial correction factors set equal to zero,and (2) with the initial correction factors set equalto the actual optimal factors that are learned foran environment 5 dB higher in SNR using stereo

Fig. 8. E�ect of the number of adaptation sentences used to learn the correction factors on the recognition accuracy obtained using the

STAR algorithm. Stereo data was used to learn the correction factors.

Fig. 9. Comparison of recognition accuracy obtained using the blind STAR, and stereo-based STAR algorithms. The line with star

symbols represents the original stereo based STAR algorithm. The line with diamond symbols represents the blind STAR algorithm

bootstrapped from the distributions that were closest with respect to SNR to the SNR of the test environment.


data. For example, to learn the correction factorsfor 25 dB data we would use the correction factorslearned with stereo data at 30 dB. This process isrepeated up to 0 dB. The goal of this experimenton di�erent initializations was to explore the sensi-tivity of blind STAR to initial conditions. We ob-serve that the performance of the blind STARalgorithm depends a great deal on which distribu-tions are chosen initially. While the zero initializa-tion gives us some degree of improvement inrecognition accuracy compared to CMN, a betterinitial guess of the correction factors improvesthe performance of blind STAR to almost the levelof stereo STAR. In fact, at the lowest SNRs theblind-STAR actually outperforms the stereo-basedSTAR, perhaps because the assumption that the aposteriori probabilities P �st�k�jyt� can be replacedby P �st�k�jxt� is not valid at lower SNRs. This de-pendency on initial conditions is a serious limita-tion of the blind STAR algorithm that needs tobe addressed.

5.2.3. Comparison of STAR with RATZFig. 10 compares the recognition accuracy ob-

tained using the STAR and RATZ algorithmseach in both the blind and stereo-based implemen-tations. These results were all obtained using 100adaptation sentences, an equivalent availabilityof stereo data to learn the correction factors, andcomparable recognition systems with 256 Gauss-ian mixtures. In general, the STAR algorithms al-

ways outperform the RATZ algorithms. TheSTAR algorithm is able to produce almost thesame performance of a fully retrained system forSNRs as low as 5 dB. For lower SNRs the a pos-teriori invariance assumption made by the algo-rithm is not appropriate and recognitionaccuracy su�ers.

It can be shown that approaches that adapt theHMMs to the environmental conditions of thespeech in the testing environment approach thestructure of an optimal classi®er and should pro-vide improved recognition accuracy performancefor any class of techniques. Several authors (Mor-eno, 1996; Mokbel and Chollet, 1995) provide ex-planations for this phenomenon. Hence, we expectthe STAR technique, which modi®es the classi®er,to outperform the RATZ technique, which modi-®es the data. Given equivalent experimental condi-tions, experimental results bear this hypothesisout.

6. Summary and conclusions

This paper addresses the problem of data-driv-en environmental robustness algorithms. Startingwith a study of the e�ects of the environmenton speech distributions we proposed a mathemat-ical framework based on the EM algorithm forenvironment compensation. Two generic data-driven approaches have been proposed. The ®rstapproach modi®es incoming cepstral vectorswhile the second one modi®es the mean vectorsand covariances matrices of the acoustical distri-butions of the statistical representation of the cep-stra of clean speech developed by the HMMs. Wehave shown how both approaches can be derivedwithin a common uni®ed mathematical frame-work.

We performed a series of simulations using arti-®cially-corrupted data to study in a controlledmanner to study the e�ects of the environmenton speech-like log spectral distributions. Fromthese simulations we observed that while the distri-butions of log spectra of speech are no longerGaussian when submitted to additive noise andlinear channel distortions the changes can be wellcaptured by simple additive correction terms in

Fig. 10. Comparison of recognition accuracy obtained using

blind and stereo con®gurations of the STAR and RATZ algo-

rithms. The same 100 sentences were used for all adaptation al-

gorithms.


the means and variances. Motivated by theseobservations we modeled the e�ects of the environ-ment on Gaussian speech distributions as correc-tion factors to be applied to the mean vectorsand covariance matrices.

We developed two families of algorithms for da-ta-driven environmental compensation. The ®rstset of algorithms (referred to as the RATZ algo-rithms) speci®es the correction factors to be ap-plied to the incoming vector representation ofdegraded speech. The second family of algorithms(referred to as the STAR algorithms) providedcorrections to the speech distributions that repre-sent degraded speech in the HMM classi®er. Thevalues of the correction factors were learned intwo di�erent ways: using simultaneously-recordedclean and degraded speech databases (``stereo'' da-tabases) to learn correction factors directly fromdata (stereo RATZ and STAR), and iterativelylearning the correction factors directly from thedegraded data alone (blind RATZ and STAR).We presented a uni®ed framework for the RATZand STAR algorithms, showing that techniquesthat attempt to modify the incoming cepstra vec-tors and techniques that modify the parametersof the distributions of the HMMs can be describedby the same theory. The STAR techniques thatmodify the parameters of the distributions that in-ternally represent speech generally outperform theRATZ techniques that modify the incoming ceps-tral vectors of degraded speech, given equivalentexperimental conditions.

We have also shown that these data-drivencompensation techniques perform quite well evenwith only ten sentences of adaptation data. Whencomparing the proposed algorithms with the per-formance of a fully retrained system we observethat they can provide very similar recognition ac-curacy for SNRs as low as 15 dB for the RATZfamily of algorithms and SNRs as low as 5 dBfor the STAR family of algorithms.

Appendix A. EM solutions for the correction factors

rk and Rk

In this appendix, we provide detailed solutionsfor the correction terms rk and Rk. Given a loglikelihood function L(Y):

L�Y� � log�l�Y�� XT

t�1

log�p�yt��

�XT

t�1

logX

k

ak�t�Ny�rk � lx;k;Rk � Rx;k� !

:

�A:1�

Our goal is to ®nd the complete set of K termsrk and Rk that maximize the likelihood (or log like-lihood). As it turns out there is no direct solutionto this problem and some indirect method is neces-sary. The Expectation-Maximization (EM) algo-rithm is one of this methods.

The EM algorithm de®nes a new auxiliary func-tion Q�/;/� as

Q�/;/� � E�L�Y ; Sj/�jY ;/�; �A:2�where the (Y,S) pair represent the complete data,composed of the observed data Y (the noisy vec-tors) and the unobserved data S (indicating whichGaussian/state produced an observed data vec-tor). This equation can be easily related to theBaum±Welch equations used in Hidden MarkovModelling. The / symbol represents the set of pa-rameters (K correction vectors and K correctionmatrices) that maximize the observed data

/ � fr1; . . . ; rk;R1; . . . ;Rkg: �A:3�The / symbol represents the same set of param-

eters as / but with di�erent values. The basis ofthe EM algorithm lies in the fact that given twosets of parameters, / and /, if Q�/;/�PQ�/;/�, then L�Y ;/�P L�Y ;/�. In other words,maximizing Q�/;/� with respect to the parameters/ is guaranteed not to decrease the likelihoodL�Y;/�.

If the unobserved data S are represented by themixture index K, Eq. (A.2) can be expanded as

Q�/;/� � E�L�Y ; Sj/�jY ;/�

�XT

t�1

XK

k�1

p�yt; st�k�j/�p�ytj/�

log�p�yt; st�k�j/��;

�A:4�


hence

Q�/;/�

�XT

t�1

XK

k�1

P �st�k�jyt;/�flog ak�t�

ÿ D2

log�2p� ÿ D2

log jRk � Rx;kj ÿ 1

2�yt ÿ lx;k

ÿ rk�T�Rx;k � Rk�ÿ1�yt ÿ lx;k ÿ rk�g; �A:5�where D is the dimensionality of the cepstrum vec-tor. The expression can be further simpli®ed to:

Q�/;/� � constant

�XT

t�1

XK

k�1

P �st�k�jyt;/� ÿD2

log jRk � Rx;kj�

ÿ 1

2�yt ÿ lx;k ÿ rk��Rx;k � Rk�ÿ1�yt ÿ lx;k ÿ rk�

�:

�A:6�To ®nd the / parameters we simply take deriv-

atives and set equal to zero. After some manipula-tion we obtain

rrk Q�/;/� �XT

t�1

P �st�k�jyt;/��Rx;k

� Rk�ÿ1�yt ÿ lx;k ÿ rk� � 0;

r�Rx;k�Rk �ÿ1 Q�/;/�

�XT

t�1

P�st�k�jyt;/� �Rx;k � Rk� ÿ �yt ÿ lx;k

nÿ rk��yt ÿ lx;k ÿ rk�T

o;

�A:7�hence

rk �PT

t�1 P �st�k�jyt;/�ytPTt�1 P �st�k�jyt;/�

ÿ lx;k; �A:8�

Rk

�PT

t�1 P �st�k�jyt;/��yt ÿ lx;k ÿ rk��yt ÿ lx;k ÿ rk�T�PTt�1 P �st�k�jyt;/�

ÿ Rx;k �A:9�Eqs. (A.8) and (A.9) for the basis for an itera-

tive algorithm. The EM algorithm guarantees thateach iteration does not decrease the likelihood ofthe observed data.

References

Acero, A., 1991. Acoustical and Environmental Robustness in

Automatic Speech Recognition. Kluwer Academic Press,

Boston, MA.

Acero, A., Stern, R.M., 1990. Environmental robustness in

automatic speech recognition. In: Proceedings of IEEE

International Conference on Acoustics, Speech, and Signal

Processing, Albuquerque, NM, 1990, pp. 849±852.

Anastasakos, A., Kubala, F., Makhoul, J., Schwartz, R., 1994.

Adaptation to new microphones using tied-mixtures nor-

malization. In: Proceedings of IEEE International Confer-

ence on Acoustics, Speech, and Signal Processing, vol. 1,

Adelaide, Australia, pp. 433±436.

Baum, L., 1972. An inequality and associated maximization

technique in statistical estimation of probabilistic functions

of Markov processes. Inequalities 3, 1±8.

Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum

likelihood from incomplete data via the EM algorithm (with

discussion). J. Roy. Statist. Soc. Ser. B 39, 1±38.

Ephraim, Y., 1992. Statistical-model-based speech enhance-

ment systems. Proc. IEEE 80, 1526±1555.

Flanagan, J., Johnston, J., Zahn, R., Elko, G., 1985. Computer-

steered microphone arrays for sound transduction in large

rooms. J. Acoust. Soc. Am. 78, 1508±1518.

Gales, M.J.F., 1995. Model-Based Techniques for Noise

Robust Speech Recognition. Ph.D. Thesis, Engineering

Department, Cambridge University, Cambridge, UK.

Gales, M.J.F., Young, S.J., 1993. Cepstral parameter compen-

sation for HMM recognition in noise. Speech Commun. 12,

231±239.

Ghitza, O., 1986. Auditory nerve representation as a front-end

for speech recognition in a noisy environment. Comput.

Speech Language 1, 109±130.

Hermansky, H., 1990. Perceptual linear predictive (PLP)

analysis of speech. J. Acoust. Soc. Am. 87, 1738±1752.

Hermansky, H., Morgan, N., 1994. RASTA processing of

speech. IEEE Trans. Speech Audio Process. 2 (4), 578±589.

Huang, X., Ariki, Y., Jack, M.A., 1993. Hidden Markov

Models for Speech Recognition. Edinburgh University

Press, Edinburgh, UK.

Huang, X., Lee, K.-F., Hon, H.-W., Hwang, M.-Y., 1991.

Improved acoustic modeling with the SPHINX recognition

system. In: Proceedings of IEEE International Conference

on Acoustics, Speech, and Signal Processing, vol. 1,

Toronto, Canada, pp. 235±238.

Juang, B., 1991. Speech recognition in adverse environments.

Comput. Speech Language 5, 275±294.

Junqua, J.-C., Haton, J.-P., 1996. Robustness in Automatic

Speech Recognition. Kluwer Academic Press, Boston, MA.

Lee, K.-F., Hon, H.-W., Reddy, R., 1990. An overview of the

SPHINX speech recognition. IEEE Trans. Acoust. Speech

Signal Process. 38 (1), 35±45.

Leggeter, C.J., Woodland, P.C., 1995. Maximum likelihood

linear regression for speaker adaptation of continuous

density hidden Markov models. Comput. Speech Language

9 (2), 171±185.


Liu, F.-H., Stern, R.M., Acero, A., Moreno, P.J., 1994.

Environment normalization for robust speech recognition

using direct cepstral comparison. In: Proceedings of IEEE

International Conference on Acoustics, Speech, and Signal

Processing, vol. 2, Adelaide, Australia, pp. 61±64.

Mokbel, C., Chollet, G.F.A., 1995. Automatic word recogni-

tion in cars. IEEE Trans. Speech Audio Process. 3(5), 346±

356.

Moreno, P.J., 1996. Speech recognition in noisy environments.

Ph.D. Thesis, Department of Electrical and Computer

Engineering, Carnegie Mellon University.

Moreno, P.J., Raj, B., Gouvea, E., Stern, R.M., 1995a.

Multivariate Gaussian-based cepstral normalization. In:

Proceedings of IEEE International Conference on Acous-

tics, Speech, and Signal Processing, vol. 1, Detroit, MI, pp.

137±140.

Moreno, P.J., Raj, B., Stern, R.M., 1995b. A uni®ed approach

to robust speech recognition. In: Proceedings of Eurospeech

1995, vol. 1, Madrid, Spain, pp. 481±484.

Moreno, P.J., Raj, B., Stern, R.M., 1996. A vector Taylor series

approach for environment-independent speech recognition.

In: Proceedings of IEEE International Conference on

Acoustics, Speech, and Signal Processing, vol. 2, Atlanta,

GA, pp. 733±736.

Neumeyer, L., Weintraub, M., 1994. Probabilistic optimum

®ltering for robust speech recognition. In: Proceedings of

IEEE International Conference on Acoustics, Speech, and

Signal Processing, vol. 1, Adelaide, Australia, pp. 417±420.

Paul, D., Baker, J., 1992. The design of the wall street journal-

based CSR corpus. In: Proceedings of ARPA Speech and

Natural Language Workshop, pp. 357±362.

Sankar, A., Lee, C.-H., 1995. Robust speech recognition based

on stochastic matching. In: Proceedings of IEEE Interna-

tional Conference on Acoustics, Speech and Signal Process-

ing, vol. 1, Detroit, MI, pp. 121±124.

Sene�, S., 1988. A joint synchrony/mean-rate model of auditory

speech processing. J. Phonetics 16, 55±76.

Stern, R.M., Liu, F.-H., Moreno, P.J., Acero, A., 1994. Signal

processing for robust speech recognition. In: Proceedings of

International Conference on Spoken Language Processing,

vol. 3, pp. 1027±1030.

Stern, R.M., Acero, A., Liu, F.-H., Ohshima, Y., 1996. Signal

processing for robust speech recognition. In: Lee, C.-H.,

Soong, F., Paliwal, K.K. (Eds.), Automatic Speech and

Speaker Recognition. Kluwer Academic Publishers, Boston,

MA, pp. 351±378.

Sullivan, T.M., Stern, R.M., 1993. Multi-microphone correla-

tion-based processing for robust speech recognition. In:

Proceedings of IEEE International Conference on Acous-

tics, Speech, and Signal Processing, vol. 2, Minneapolis,

MN, pp. 91±94.

Varga, A.P., Moore, R.K., 1990. Hidden Markov model

decomposition of speech and noise. In: Proceedings of

IEEE International Conference on Acoustics, Speech

and Signal Processing, vol. 2, Albuquerque, NM, pp.

845±848.


Data-driven environmental compensation for speech recognition: A unified approach

Documents

Transcript of Data-driven environmental compensation for speech recognition: A unified approach