Maximum likelihood stochastic transformation adaptation...

doi:10.1006/csla.2001.0168Available online at http://www.idealibrary.com on

Computer Speech and Language(2001)15, 257–285

Maximum likelihood stochastic transformationadaptation for medium and small data sets

Constantinos Boulis,† Vassilios Diakoloukas and VassiliosDigalakis

Department of Electronics and Computer Engineering, Technical University ofCrete, Hania, Greece

Abstract

Speaker adaptation is recognized as an essential part of today’s large-vocabularyautomatic speech recognition systems. A family of techniques that has beenextensively applied for limited adaptation data is transformation-based adaptation.In transformation-based adaptation we partition our parameter space in a set ofclasses, estimate a transform (usually linear) for each class and apply the sametransform to all the components of the class. It is known, however, that additionalgains can be made if we do not constrain the components of each class to use thesame transform. In this paper two speaker adaptation algorithms are described.First, instead of estimating one linear transform for each class (as maximumlikelihood linear regression (MLLR) does, for example) we estimate multiplelinear transforms per class of models and a transform weights vector which isspecific to each component (Gaussians in our case). This in effect means that eachcomponent receives its own transform without having to estimate each one ofthem independently. This scheme, termed maximum likelihood stochastictransformation (MLST) achieves a good trade-off between robustness and acousticresolution. MLST is evaluated on theWall Street Journal(WSJ) corpus fornon-native speakers and it is shown that in the case of 40 adaptation sentences thealgorithm outperforms MLLR by more than 13%. In the second half of this paper,we introduce a variant of the MLST designed to operate under sparsity of data.Since the majority of the adaptation parameters are the transformations, weestimate them on the training speakers and adapt to a new speaker by estimatingthe transform weights only. First we cluster the speakers in a number of sets andestimate the transformations on each cluster. The new speaker will usetransformations from all clusters to perform adaptation. This method, termed basistransformation, can be seen as a speaker similarity scheme. Experimental resultson the WSJ show that when basis transformation is cascaded with MLLR marginalgains can be obtained from MLLR only, for adaptation of native speakers.

c© 2001 Academic Press

1. Introduction

Recent automatic speech recognizers are usually trained on a set of speakers that are con-sidered to adequately model every future speaker who may use the system. However, even†Now with Department of Electrical Engineering, University of Washington, Seattle, WA, U.S.A.E-mails:[email protected]; [email protected]; [email protected]

0885–2308/01/030257 + 29 $35.00/0 c© 2001 Academic Press

258 C. Bouliset al.

if we select a large number of training speakers, it is still possible to adapt the system to thecurrent speaker and achieve superior performance. The need for adaptation becomes evenhigher when the testing conditions are significantly different from the training conditions.For example, using an automatic speech recognizer on non-native speakers where the train-ing data were collected from native speakers or recognizing noisy speech when the systemwas trained on noiseless data are cases where the mismatch problem arises.

Mismatch compensation techniques can be applied in feature or model space level (Sankar& Lee, 1996). In the first we choose to alter the speech feature vectors and leave the modelsunchanged whereas in the latter case we alter the parameter values of the models only.

Many model adaptation techniques have appeared in the literature, all of them belonging toone of two main categories (or combining them). The first category is the Bayesian methodswhere the model parameters are estimated using the maximuma posteriori criterion andassuming a prior distribution for the parameters (Gauvain & Lee, 1994). The second categoryis the transformation-based methods where a set of model parameters are adapted using thesame transform (usually linear). This kind of model adaptation appeared simultaneously inDigalakis, Rtichev and Neumeyer(1995) and Legetter and Woodland(1995) and becameknown as maximum likelihood linear regression or MLLR.

Numerous variations of the above methods have been presented in the past, some of themcombining the two categories as inChesta, Siohan and Lee(1999), Wang and Liu(1999) andGunawardana and Byrne(2000). For example, inDigalakis and Neumeyer(1996) MLLR iscascaded with MAP to give superior results than MLLR or MAP alone.

The main drawback of the Bayesian adaptation methods arises from the fact that in orderto have tractable analytical solutions they assume independent prior distributions for differ-ent parameters. This necessitates the availability of a large set of adaptation data to haveeffective change of the initial model parameters since each parameter must have been ex-plicitly observed in the adaptation data to alter its value. Alternatives have been proposed inAfify, Gong and Haton(1997) andShahsahani(1997) where a joint prior distribution thattakes into consideration the correlation between different states is presented, but these priordistributions are constructed empirically and so their effectiveness is limited.

Transformation-based approaches, on the other hand, try to overcome this limitation byestimating a transform on a class of models and apply this transform to all the models ofthe class, irrespective of whether a model was observed in the adaptation data or not. Themain drawback of these methods is that we use the same transform for all models belongingto a class although we know intuitively that this is not true. One way of compensating isby having a higher number of classes but this is not a robust solution for limited adaptationdata. A better approach is to estimate multiple transforms per class and a transform weightsvector for each component (Gaussian densities in the case of large-vocabulary continuous-density systems). Since far fewer adaptation samples are needed for the robust estimationof transform weights than the transforms themselves this scheme manages to apply a uniquetransform per component and still achieve a reasonable level of robustness without estimatingeach one of them independently.

The method presented in the first half of this paper was first introduced inDiakoloukas andDigalakis(1999) under the name maximum likelihood stochastic transformations (MLST).In this work we propose a number of important improvements that affect significantly theperformance and elaborate some points that were not adequately covered inDiakoloukasand Digalakis(1999). In Section2 we review the basic equation of MLST and introduce anew form of grouping the transformed Gaussians to the original number. In Section3 weformulate the estimation of transform weights and transforms using different tying levels for

Maximum likelihood stochastic transformation adaptation 259

the two sets of adaptation parameters. In Section4 we present an algorithm to automaticallyderive the optimum number of transforms per class, based on the number of samples that eachclass has received. In Section5 we address many important implementation issues such asthe combination of different types of transforms for each class, describe an improved backoffscheme for classes that do not have enough samples to estimate robust transforms and usedifferent initialization schemes for the estimation of transforms. In Section6 we carry outa set of experiments to evaluate all of our newly introduced improvements. In Section7 weintroduce a novel variant of MLST specifically designed to work under sparsity of data. Wedrastically reduce the number of adaptation parameters by selecting transformations fromother training speakers. Adaptation will be carried out by updating the transform weightsonly. The proposed variant can be seen as a speaker similarity scheme, where the values ofthe transform weights show the preference to specific speakers. In Section8 we describe thetechniques used to cluster the speakers. In Section9 the methodology used to generate the setof basis transforms is explained. In Section10 we report experimental results for the variantof MLST. In Section11 we comment on related work and compare the various approaches.Finally, in Section12we briefly state the contributions of this work and describe future work.

2. Multiple linear transforms

Let us now formulate our method. It is believed that from the entire set of parameters ofa modern automatic speech recognizer the observation densities have the largest impact inrecognition performance. In this paper, as in most prior papers, only the observation densitieshave been adapted. We assume that the observation density for each HMM states is a mixtureof continuous Gaussian densities having the form

PSI(ot |s) =Nω∑j=1

p(ω j |s)N(ot ;ms j, Ss j) (1)

wherep(ω j |s) is the weight of thej th Gaussian density of thes state andms j, Ss j are themean and covariance vectors of dimensiond each (assuming diagonal covariances) for thej th Gaussian density of thes state.Nω is the number of Gaussian densities that comprise theobservation density of states andot is thet th observation vector.

MLLR adapts mixtures using the following equation:

PS A(ot |s) =Nω∑j=1

p(ω j |s)N(ot ; Acms j + bc, Ss j) (2)

where againp(w j |s) is the probability of thej th density given states, and [Ac,bc] arethe transformation parameters used for adaptation of classc. Alternatively, we can chooseto transform the covariances as well, but since it has not been sufficiently shown that thisprovides additional gains we can choose to transform only the means (Gales & Woodland,1996). The rotation matrixAc can be full, block-diagonal or diagonal but experiments haveshown (Neumeyer, Sankar & Digalakis, 1995) that gains are higher when it is chosen to beblock-diagonal (usually three blocks: for cepstrum, delta and delta–delta coefficients). In thiscase the number of free adaptation parameters for the rotation matrix is(d×d)/3 and for thebias vectorbc it is d in all cases.

On the other hand MLST transforms mixtures according to


Nλ∑k=1

p(λk|s, ω j )p(ω j |s)N(ot ; Ackms j + bck, Ss j) (3)

260 C. Bouliset al.

where p(λk|s, ω j ) is the transform weight for thekth transform of thej th Gaussian,Nλ isthe number of transforms per class and[Ack,bck] is thekth transform for classc. As can beseen from the above equation, MLST multiplies the total number of Gaussian components byNλ resulting inNω × Nλ total Gaussians. Of course, this is unacceptable in most situationssince the speed of the recognizer will greatly decrease, so we explore three different ways ofgrouping the components and returning to the original number of parameters of the system.

The first type of grouping consists of selecting the transform with the highest weight. Thatis


p(ω j |s)N(ot ; Ack′ j ms j + bck′ j , Ss j)

where

k′j = arg maxk{p(λk|s, ω j )}. (4)

The second type of grouping consists of taking the linear combination of transforms


p(ω j |s)N(ot ; A′cj ms j + b′cj , Ss j)

where

A′cj =

Nλ∑k=1

p(λ|s, ω j )Ack and b′cj =

Nλ∑k=1

p(λ|s, ω j )bck. (5)

An advantage of the linear combination scheme is that it smoothes the estimation errors of thetransformation matrices. This means that we can estimate more transforms now even if eachone of them is less robust than each MLLR transform. By applying the linear combinationscheme we can have an effectively robust transform.

The third type of grouping consists of merging theNλ transformed Gaussians to one. Thatis

m(i )s j =

Nλ∑k=1

p(λk|s, ω j )µ(i )s jk, (6)

(σ(i )s j )

2= (σ

(i )s j )

2+

Nλ∑k=1

p(λk|s, ω j )(µ(i )s jk)

2− (m(i )

s j )2 (7)

whereµs jk = Ackms j + bck is the mean vector of states, Gaussianj as transformed by the

kth transform of classc. m(i )s j , and(σ (i )s j )

2 are thei th element of the new mean and covariancevector respectively.

This scheme results always in broader covariances, which means increased recognitiontime since a higher number of hypotheses will be active. On the other hand, it has the advan-tage that covariances are also altered even if the adaptation equation adapts only the meanvectors. If we choose to adapt the covariances as well with full or block-diagonal matricesthen we are faced with problems such as numerical solution and increased recognition time(Gales & Woodland, 1996). This scheme provides us with a simple method to alter the co-variances when using non-diagonal transforms. The first two grouping methods have beenintroduced previously inDiakoloukas and Digalakis(1999) while the third is first applied inthis work.

In Gales(1997) a relative scheme with MLST was introduced where the adaptation param-eters consist of the set of transforms and a weight vector. The transforms are then interpolated


using the weight vector to produce enhanced estimates compared with simple MLLR. Usingmathematical notation the scheme presented inGales(1997) adapts output observation prob-abilities using


p(ω j |s)N

(ot ;

R∑r=1

p(λr |s, ω j )Ar ms j +

R∑r=1

p(λr |s, ω j )br , Ss j

)(8)

whereR is the number of regression classes andp(λr |s, ω j ) is the weight of Gaussianj ofstates for regression classr . This scheme can be seen as a special case of the MLST algo-rithm. The MLST with linear combination of transforms can result in almost identical adap-tation equations. However, there are a number of advantages of MLST overGales(1997).First, the scheme inGales(1997) estimates one linear transform per class while MLST es-timates multiple transforms per class. This has the following advantage. Suppose we haveenough samples to estimate ten transforms. Then the weight vector ofGales(1997) will becomprised of ten elements, one for each class transform. On the other hand, in MLST weare free to choose the number of transforms per class so we can estimate two transforms perclass for five classes. In this way we have limited the power of each transform since theyare estimated on a higher level of tying but the transform weight vector is comprised of twoelements. This leads to many more components being able to estimate their own transformweight vector, which results in many more effective transforms. MLST can be seen as adensity combinationscheme while the method presented inGales(1997) can be seen as atransformation combinationscheme.

3. Tying of transform weights and transforms

Reviewing (Diakoloukas & Digalakis, 1999) the transform weights formula for thes state,j th Gaussian,k-transform is given by

p(λk|ω j , s) =

∑t γs jk(t)∑

k∑

t γs jk(t)(9)

whereγs jk(t) = p(st = s, ωt = j, λt = k|ot ), st is the state for timet , ωt is the Gaus-sian component for timet , λt is the transform for timet and ot is the input observationvector for timet . Notice that another hidden variable is now added in the MLST formula-tion. Along with the state and Gaussian component sequence, the transform sequence (thatis which specific transform from the many that exist for a class) must also be estimated. Theexpectation-maximization (EM) algorithm can be used to address problems of estimation ofhidden variables.

For the quantityγs jk(t) we can write

γs jk(t) = p(st = s, ωt = j, λt = k|ot )

= p(st = s, ωt = j |ot , λt = k) · p(λt = k|ot ). (10)

The first term of Equation (10) is estimated from the standard forward–backward proce-dure since we assume the hidden variable of the transform sequence to be known. The secondterm can be calculated using:

p(λt = k|ot ) =p(λk|ω j , s)N(ot ; Ackms j + bck, Ss j)∑Nλ

k=1 p(λk|ω j , s)N(ot ; Ackms j + bck, Ss j). (11)

We see from the above that the calculation ofγs jk(t) requires more arithmetic operations than

262 C. Bouliset al.

the analogous MLLR quantityγs j(t). However, we will show in Section6 that, in practice,the degradation in speed is almost negligible.

Examining the transform weights equation as it was presented inDiakoloukas and Digalakis(1999) we can observe that each Gaussian component estimates its own transform weightsvector. In large-vocabulary systems when few adaptation data are used it is not realistic toexpect every Gaussian component to estimate robustly its own transform weights vector so aform of tying is necessary.

We have set an empirically determined threshold for the total samples assigned to all thetransforms of a Gaussian component. If the number of samples is less than this threshold weestimate the transform weight on the state level; if the state did not receive enough sampleswe use the class-level transform weight vector. The transform weights equations are

p(λk|ω j , s) =

∑t γs jk(t)∑

k∑

t γs jk(t)(Gaussian level) (12)

if ∑k

∑t

γs jk(t) > T

else

p(λk|ω j , s) = p(λk|s) =

∑j∑

t γs jk(t)∑k∑

j∑

t γs jk(t)(state level) (13)

if ∑k

∑j

∑t

γs jk(t) > T

else

p(λk|ω j , s) = p(λk|c) =

∑s∈c

∑j∑

t γs jk(t)∑s∈c

∑k∑

j∑

t γs jk(t)(class level) (14)

if ∑s∈c

∑k

∑j

∑t

γs jk(t) > T

wherec is the class that states belongs to.Similar tying schemes are used for the transforms themselves. The transforms as stated

before are estimated on a set of states (class). If the total number of samples is below athreshold we can choose to estimate the transforms on a superset of the class (backoff class)thus achieving more robust estimation. The transforms are estimated using∑

s∈c

∑j

∑t

γs jk(t)S−1s j ot µ̂

′

s j =∑s∈c

∑j

∑t

γs jk(t)S−1s j ot Wckµ̂

′

s jµ̂′

s j (15)

whereµ̂s j = [1µ′s j]′ (µ′s j is the transpose of theµs j vector) andWck = [bck|Ack] is thekth

transform for classc.Finally, it is important to state that it has been experimentally observed that far fewer

samples are needed for the transform weights than for the transforms themselves to achieverobust estimation. This means that the transform weights threshold must be an order of mag-nitude smaller than the transform threshold, which enables detailed acoustic resolution evenin small adaptation data sets. Nevertheless, it always pays to introduce a form of tying evenin transform weights, as we will show in the following section.


4. Dynamic number of transforms

In the main MLST Equation (3) it was implicitly assumed that the same number of transformsis estimated for all states(Nλ) as inDiakoloukas and Digalakis(1999). However, since thenumber of samples associated with each class is clearly non-uniform it is not realistic toexpect every class to robustly estimate the same number of transforms. The problem is evenmore intense when we use the necessary threshold to estimate (or not) a transform. Let usgive a concrete example.

Suppose a class has received enough samples to estimate three transforms (of any trans-form type) but we have chosen a value ofNλ = 10. It is likely that no transform will beestimated since the training samples will be distributed across all ten transforms and nonewill reach the predefined threshold. In effect, this means that many samples are not used.

We could set the number of transforms per class to be the minimum number of trans-forms any class can estimate but this greatly reduces the acoustic resolution of the adaptationmethod since the other classes will be able to estimate more transforms but are not allowedto.

A better approach is to dynamically determinate the optimum number of transforms eachclass can estimate. Such an algorithm is described below.

Step0: Initially set the same number of transforms for all classes or use an initial estimatefor the number of transforms for each class.Step1: Run the forward–backward or the Viterbi training algorithm to collect sufficient statis-tics.Step2: Count the number of samples for each transform of each class. If a transform hasreceived fewer samples than a predetermined threshold delete the transform and all mem-ory requirements associated with it. If a class was found to have two or more non-robusttransforms select any one (but only one) of the non-robust transforms and delete it.Step3: For all the classes that were found to have at least one non-robust transforms decre-ment the number of transforms for these classes by one and do not perform any adaptationfor this EM iteration.Step4: Go to step1 until all classes have only robust transforms.

The main point of the algorithm is that if a class was found to have two or more non-robusttransforms1 it deletes one of them and then re-runs the forward–backward or Viterbi training.If a class can estimate three transforms and was set to estimate ten transforms it will probablyestimate none. So if we reduced the number of transforms for this class by nine or more wewould have limited the acoustic resolution of the adaptation algorithm. Gradually reducing itby one at each EM iteration guarantees that eventually we will reach the optimum number oftransforms for a class.

However, with this algorithm more EM iterations are needed since some of them will beused to determine the optimum number of transforms and the others to perform the actualadaptation. This is not a serious drawback, for three reasons. First, we can determine a goodinitial estimate of the number of transforms for each class resulting in few EM iterations todetermine the optimum number of transforms. Second, we can increase the sophistication ofthe algorithm by allowing it to reduce the number of transforms by two or more, when a class

1A non-robust transform is a transform with a small number of samples associated with it. This means that thevariance of the transform parameters given the specific data is high and therefore their values are not reliable. Thethreshold below which a transform is considered non-robust is determined empirically.

264 C. Bouliset al.

was found with many non-robust transforms. Third, the adaptation process is considered tobe off-line in most cases so adaptation time is not an issue.

5. Implementation issues

As was described earlier there exist three choices for the transform type: full, block-diagonalor diagonal. InDiakoloukas and Digalakis(1999), Digalakiset al. (1995) andLegetter andWoodland(1995) the same transform type is assumed for all transforms. It is interesting,though, to explore different types of transforms for a class. For example, it may be morebeneficial to use four block-diagonal and five diagonal transforms than five block-diagonalor 20 diagonal transforms for a given class.

The issue of different transform type is not so interesting in the MLLR case since onlyone transform is estimated per class, as it is in MLST. Given the number of samples a classreceived we can investigate different ways of assigning them in transforms.

Another issue is when to backoff a class. In MLLR, if a class did not receive enoughsamples then the transform for a superset of the class was estimated. InDiakoloukas andDigalakis(1999) an analogous scheme is adopted. If a transform of a class did not receiveenough samples then it is equated with the MLLR transform for a superset of the class.

In MLST, however, the backoff mechanism can be different. We can apply the same MLSTframework to the backoff classes as well as the primary classes. Now the question that arisesis when to backoff. A class may have enough samples to estimate one transform but thebackoff class may be more suitable since it can estimate many more. In this work a class willbackoff to MLST backoff classes when no transform can be estimated for this class.

Another issue is the initialization of transforms. Since we have many transforms for a classthese transforms must be initialized differently for the following EM iterations. One way ofinitialization was suggested inDiakoloukas and Digalakis(1999), where a slightly perturbedidentity matrix was used. Specifically, for all statess and for j = 1, . . . , Nλ

As j = I

bs j = h j ⊗ Ss (16)

whereI is the identity matrix and⊗ represents the element-wise product of two vectors.Ss

is the covariance vector of any Gaussian of the output probability of states andh j is the j thcolumn of ad × d matrix with every element randomly selected equal to 1 or−1, andd isthe dimension of the offset vectorbs j. Usually the result is scaled with a coefficient muchsmaller than unity. The initial transforms of each class derived from this scheme are differentbut close to each other, so many EM iterations are needed in order to have effective changeof the means of the observation probabilities.

Another approach is to perturb not the identity matrix but the MLLR transform for thisclass. With this scheme we may achieve a better local optimum although the same number ofEM iterations may be needed. We first estimate the MLLR transform for each class and thenadd it to the Hadamard initialization equations to generate an improved initial transform.

A third approach, which requires fewer EM iterations to converge, is to split the initial classinto Nλ subclasses and estimate a MLLR transform for each subclass. Each of these MLLRtransforms is then used to initialize theNλ transforms of MLST. This scheme requires fewerEM iterations than the previous two schemes since the initial transforms will be substantiallydifferent and also the initial transforms will be estimated using regression techniques such asMLLR rather than heuristic schemes.

Another issue that needs to be examined is the memory and speed requirements of the


new method. MLST estimates multiple transforms per class so it is required to store theGaussians as transformed by each one of the transforms. Supposing that each class hasNλtransforms andNω Gaussians, then we need to keep in memory (primary or secondary)Nλ×Nω Gaussians. For each Gaussian we need to store its mean and covariance vector along withzero-, first- and second-order statistics, thus the total storage requirements for each mixtureare Nλ × Nω × 4× (d + 1) floats. This can be reduced in half if we choose to adapt onlythe means, since only the mean vector and the first-order statistics need to be stored. Inpractice, it was shown than in all cases no more than 60 MB need to be allocated in excess ofthose used for MLLR. With today’s standards these memory requirements are not consideredprohibitive. Alternatively, we could use a caching mechanism without storing in the primarymemory all the transformed Gaussians. We could keep in memory the most frequently askedtransformed Gaussians and compute online the rest. Since most of the Gaussians are not usedat all during the forward–backward algorithm (because of the pruning strategy applied) andsome others are scarcely used this caching scheme can provide a viable solution for systemswith limited memory resources.

In addition, as can be seen by Equations (9) and (10) the forward–backward procedurefor MLST involves more arithmetic operations than MLLR since we add another hiddenvariable in our formulation. In practice, this did not cause a serious degradation in the speedof forward–backward since we apply a pruning strategy. With pruning we do not expand allpossible paths but only those that are within a range of values from the best path of each timemoment. Since with MLST we achieve better acoustic match with the current speaker thanwith MLLR, more paths will be pruned in MLST than with MLLR.

For example, using MLLR requires 1·14 s of processing time for each one of the adaptationsentences on a Intel Pentium-III 450 MHz computer with 256 MB memory, while MLSTrequires at most 1·27 s per sentence. This time concerns the highest number of transformsper class that we have been experimenting with. If we choose a lower number of transformsthen the processing time will also be lower.

6. Experiments for MLST

We evaluated our algorithm using SRI’s DECIPHERTM system on the “spoke 3” task of thelarge-vocabularyWall Street Journal(WSJ) corpus (Paul & Baker, 1992). The “spoke 3” taskconsists of outlier or non-native speakers for whom adaptation is truly needed.

The system’s front-end was configured to output 39 coefficients per speech frame, cep-strum, delta cepstrum, delta–delta cepstrum and their respective energies. The cepstral fea-tures are computed from an FFT filter bank and subsequent cepstral-mean normalization ona sentence basis is performed.

The speaker-independent, continuous HMM models that were used as seed models foradaptation were gender dependent, trained on 140 speakers and 17 000 sentences for eachgender. Each of the two systems was using about 500 codebooks of Gaussian densitiesof size 32 each, resulting in about 15 000 total Gaussians. The Gaussian codebooks wereshared among 12 000 context-dependent phonetic models. We used the 5000-word closed-vocabulary bigram language model provided by MIT Lincoln Laboratory. The test set con-sisted of six female and five male speakers with 20 sentences each (3843 words).

The speaker-independent word-error rate for this test set is 27·4%. The same system testedon native speakers resulted in a 12·0% word-error rate. The degradation in performance forthe non-native speakers clearly shows that an adaptation algorithm is necessary in order toallow the system to be used by them also.

266 C. Bouliset al.

TABLE I. Comparison of different initializa-tion methods

Initialization methods

#E

Mite

ratio

ns

I II III IV1 23·8 21·9 20·1 19·05 18·0 16·9 16·5 15·98 17·2 16·3 16·0 15·6

10 16·6 15·8 15·8 15·712 16·2 15·7 15·8 15·615 15·8 15·5 15·7 15·5

All the adaptation experiments were performed only on the means due to memory con-straints as explained in Section5, but also because altering the covariances as well did notprovide significant gains in the past.

We first evaluated the different initialization schemes. We used 40 adaptation sentences foreach speaker, with a single class (all states belong to the same class) and ten transforms perclass. The transform weight threshold was set to five samples and the transforms estimatedwere solely block-diagonal with three submatrices each (cepstrum, delta and delta–delta).The grouping method was the linear combination. The results are shown in TableI.

Method I is the initialization with Hadamard matrix. Method II is the initialization com-bining MLLR and Hadamard with the MLLR matrix estimated with one EM iteration. Thatis, prior to these adaptation experiments we estimated the global class MLLR transform forone EM iteration and then added it to the Hadamard matrix. Method III is the initializationscheme with the MLLR transforms for each subclass estimated using one EM iteration andmethod IV the same as method III but using five EM iterations. For methods III and IV weestimated MLLR transforms for ten classes using one and five EM iterations respectively andused these transforms as initial values for the MLST transforms.

The results show that after many EM iterations all the schemes tend to converge. However,the rate of convergence is highly different between the initialization methods. We observefrom TableI that 15 EM iterations are needed for method I to achieve the performance ofthe other methods with ten EM iterations. Although methods III and IV tend to perform alittle better than method II, we used method II in our subsequent experiments. This is becausemethod II needs to be run for each number of classes irrespective of the number of transformsfor these classes. Methods III and IV, on the other hand, need to be run for each number ofclasses and the number of transforms per class making the adaptation procedure slower.

We then evaluated the effect of transform weight tying. We used 40 adaptation sentencesfor each speaker, ten acoustic classes to estimate the transforms with ten transforms per class,using block-diagonal transforms initialized using the method II and using the linear combi-nation of the means of the transformed gaussians. All the experiments were run for ten EMiterations. The results are summarized in TableII .

From TableII we can clearly observe that it is beneficial to introduce tying between trans-form weight although the best results were produced with a small value of threshold. The firstentry (zero threshold) corresponds to no tying, that is use in all cases of Gaussian-specifictransform weights. This case was used inDiakoloukas and Digalakis(1999). The last entry(infinite threshold) corresponds to estimating always class-level transform weights: that is,apply the same transform to all the Gaussians of the class but estimate a linear combina-


TABLE II. Effect of transform weighttying

Transform weight threshold0 5 10 50 ∞

16·5 14·1 14·3 15·0 17·2

TABLE III. Comparing transform types

# Transforms per class

Tra

nsfo

rmty

pe 6 8 10 12

F,S,D 14·4 14·7 13·9 14·0F 14·7 14·6 16·1 15·9S 14·9 14·7 14·1 14·4D 17·0 15·7 16·1 15·8

tion of transforms rather than one transform. This case is similar in notion to MLLR. Theoptimum value of the transform weight threshold appears to be 5 samples.

Then we evaluated the effect of different transform type. We used 40 adaptation sentenceswith ten classes and a varying number of transforms per class. The grouping method was thelinear combination. The results are presented in TableIII .

The first transform type (F, S, D) estimates full or structured or diagonal transforms. Thatis, three different thresholds are set, one for each transform type. The following simple algo-rithm is used.

Denote withnck the number of samples thekth transform of thecth class has received.

If nck > threshold for full transform

Estimate full transform

else

If nck > threshold for structured transform

Estimate structured transform

else

If nck > threshold for diagonal transform

Estimate diagonal transform.

Under this scheme, if a class has enough samples to estimateanykind of transform then itwill estimate it, in contrast with the other three schemes where only full, only structured andonly diagonal transforms respectively are estimated. It should be noted that the above figuresare taken with the same number of effective EM iterations. In the case of the full transformsnot every class can estimate the initial number of transforms, so the algorithm will take someEM iterations to converge to the optimum number of transforms for each class. In these casesmore EM iterations are used so that all experiments are compared with the same numberof effective EM iterations. TableIII shows that there are marginal differences for use of alltransform types except one.

Next we evaluated the different grouping methods. We used 40 adaptation sentences,

268 C. Bouliset al.

TABLE IV. Linear combination of transforms for 40adaptation sentences

# Acoustic classes

#T

rans

form

spe

rcl

ass 2 5 10 20 30

1 20·6 17·8 16·4 15·6 16·42 18·7 15·1 14·7 15·4 15·34 17·0 14·4 14·5 15·2 14·76 16·4 14·3 14·4 15·0 14·88 15·7 14·1 14·7 15·2 15·9

10 15·6 13·5 13·9 14·7 17·312 15·4 14·1 14·0 14·9 19·6

TABLE V. Selecting the transform with the highestweight for 40 adaptation sentences

# Acoustic classes

#T

rans

form

spe

rcl

ass 2 5 10 20 30

1 20·6 17·8 16·4 15·6 16·42 18·7 15·2 15·1 16·1 16·34 16·6 15·3 15·2 16·8 16·56 16·5 14·9 16·9 16·3 17·08 16·2 15·0 16·7 16·5 16·1

10 16·1 15·4 16·2 16·4 16·412 16·0 16·0 16·3 16·3 16·1

allowed estimate of any kind of transform and initialized it with the MLLR+ Hadamardmatrix. The results using the linear combination scheme are summarized in TableIV.

The first row of TableIV shows the MLLR results. We can clearly see a gain from in-troducing multiple transforms with as few as two transforms per class. It is interesting tonote that by using as few as two classes for transforms we can outperform MLLR for 20classes. This clearly shows the importance of estimating transform weights in subsets of thetransforms’ classes.

The results using the transform with the highest weight for 40 adaptation sentences aresummarized in TableV. The results using the merging of transform Gaussians are summa-rized in TableVI .

From the above we observe that the linear combination of transforms produces the bestresults. We can also observe that the merging scheme quickly deteriorates as the number oftransforms per class increases, although as few as two transform per class produced betterresults than selecting the transform with the highest weight.

We also evaluated our algorithm using 20 and 10 adaptation sentences. Since the linearcombination case seems to outperform other schemes we run experiments only for this case.For the case of 20 adaptation sentences the results are summarized in TableVII . For the caseof ten adaptation sentences the results are summarized in TableVIII .

In TableIX we show the best MLLR result for each number of adaptation sentences com-pared with the best MLST result. From TableIX we can observe the superiority of the MLSTalgorithm even for small adaptation sets. For the case of 40 adaptation sentences the gain


TABLE VI. Merging the transform Gaus-sians for 40 adaptation sentences

# Acoustic classes

#T

rans

form

spe

rcl

ass 5 10 20 30

1 17·8 16·4 15·6 16·4

2 14·9 15·0 15·8 15·9

4 15·5 17·2 18·6 19·7

6 18·0 18·8 19·9 20·7

8 19·1 22·1 22·9 22·7

10 19·1 22·1 23·1 23·9

12 19·3 22·8 23·3 23·6

TABLE VII. Linear combination of transforms for20 adaptation sentences

# Acoustic classes

#T

rans

form

spe

rcl

ass 1 2 5 10 20

1 21·0 19·1 18·8 17·3 18·92 19·6 17·2 16·3 16·6 18·14 17·3 16·9 16·4 16·8 18·16 17·4 16·5 16·3 16·6 17·68 16·9 16·8 16·4 17·7 17·6

10 16·9 16·2 17·1 17·3 18·112 16·8 16·9 17·6 16·6 18·1

TABLE VIII. Linear combination of trans-forms for 10 adaptation sentences

# Acoustic classes

#T

rans

form

spe

rcl

ass 1 2 5 10

1 21·4 20·6 19·6 20·32 20·0 18·0 18·9 23·14 18·1 19·0 19·0 21·06 18·7 19·0 20·1 20·58 18·3 18·9 20·0 21·0

10 17·8 19·7 20·5 20·412 17·5 18·8 19·5 20·0

TABLE IX. Best results for MLLRand MLST under different amounts

of adaptation data

MLLR MLST

#se

nts 10 19·6 17·5

20 17·3 16·240 15·6 13·5

270 C. Bouliset al.

over MLLR is 2·1% absolute or 13·5% relative. The improvements we have introduced showMLST in significantly better performance than the one shown inDiakoloukas and Digalakis(1999). The original MLST on the same task achieved a WER of 16·8% for 40 adaptationsentences.

7. The basis transformation approach

MLST is a more general method than MLLR but as the amount of adaptation data decreaseboth methods will exhibit the same performance. In the case when there are enough data toestimate only one transformation, MLST degenerates to MLLR. In this section we introducea variant of the original MLST algorithm, specifically tailored to be used under very fewadaptation sentences.

It is widely known that the main problem with maximum likelihood techniques (such asMLST) is the number of parameters used. If the number of parameters is high then we needmany adaptation sentences to obtain robust estimates. The majority of the adaptation param-eters in MLST are comprised of the transformations’ elements. We can drastically reduce thenumber of adaptation parameters used by estimating transformations on other speakers andadapt to a new speaker by estimating only the transform weights. This scheme can be seen asa speaker similarity scheme that exploits the similarities that exist between speakers. Becausethe adaptation parameters are now reduced by an order of magnitude we can use this variantof MLST, termed basis transformation (BT) under very few data.

A first version of BT is presented inBoulis and Digalakis(2000). The algorithm wasevaluated on the Swedish ATIS corpus where the speaker-independent system is trained onnon-dialect speakers. We had also available a moderate number (39) of dialect speakers (Sca-nia speakers). We used 31 of them to generate the basis transforms and the rest for evaluation.Because of the moderate number of dialect speakers we estimated MLLR transforms for eachone of them and then used all the transforms as the basis transforms for the new speakers.When compared with MLLR the gains were almost 40%. However, the comparison was notentirely fair since experiments have shown that the transformations were already capturingan important part of the mismatch and there was little left for the transform weights. Keepingtransform weights equal during adaptation offered significant gains. The transformations canbe seen to capture the dialect part of the mismatch while updating the transform weights canbe seen as performing speaker adaptation.

In this section we evaluate BT on native speakers of the WSJ corpus in order to assessthe performance of our method when the transformations are estimated on statistically thesame data as were used during the training of the speaker-independent system. Because thenumber of training speakers for the WSJ corpus is very high (245) we cannot repeat the samemethodology as the one used in ATIS, mainly due to memory constraints of the method. Buteven if the memory problem were not present this approach would not be optimal. If thenumber of transformations per class were 245 this would make an adaptation method withso many parameters that could not be used under limited amounts of data. If we use morecompact adaptation models not only we will be able to lower the memory requirements butprobably also improve their performance.

With the MLST algorithm we can specify the number of transformations per class. Thus,we could use the MLST algorithm to generate the desired number of BT. But since the datafrom the training speakers have been also used to construct the SI system this would yieldidentity or very close to identity transformations.

Instead, we decided to cluster the speakers to sets that contain acoustically similar speakers


and then use MLST to generate the BT on each cluster. In this way the transformations willbe far from identity and the number of BT that can be generated is still our choice. Underthis perspective we can see BT as a cluster-similarity method. The estimation of transformweights, which is done with the data of the new speaker, shows the similarity of the newspeaker with the predefined clusters.

8. Clustering the training speakers

The first step in our procedure is to apply a speaker-clustering algorithm to create an initial setof clusters. For this purpose, we used a modified version ofSankar, Beaufays and Digalakis(1995). In Sankaret al. (1995) the inter-speaker distances are first calculated according to

D(m,n)12[d(m,n)+ d(n,m)] (17)

where

d(m,n) = log p(xm|λm)− log p(xm|λn) (18)

d(n,m) = log p(xn|λn)− log p(xn|λm) (19)

where logp(xm|λn) is the log-probability of observing the dataxm of the speakerm usingmodelsλn characterizing speakern. After calculating the inter-speaker distances the cluster-ing proceeds using the following algorithm:

Step0: Set the initial number of clusters to the number of speakers.Step1: If the number of clusters is the desired one exit, else go to step2.Step2: Find the pair of clusters with the minimum inter-cluster distance and mergetheir components to one cluster.Step3: Update the distance of the newly formed cluster with the other clusters. Forexample, the inter-distance between clusters c and k can be calculated using

D′(c, k) = avgm,nD(m,n) ∀m ∈ c, n ∈ k (20)

Step4: Decrement the number of clusters by one and go to step1.

Other choices exist for the inter-cluster distance but the metric in Equation (20) was shownin Sankaret al. (1995) to create the most balanced clusters. That is, the clusters created have,as close as possible, the same length. However, in our experiments the clusters were far frombeing balanced even with the metric described in Equation (20).

It was intuitively felt that roughly the same number of speakers should be present in eachcluster, to achieve the best results. This is why we altered Equation (20) and introduced apenalty term for the clusters with many speakers. The new inter-cluster distance is written as

D′(c, k) = avgm,nD(m,n)+ a log(lc + lk) ∀m ∈ c, n ∈ k (21)

wherelc, lk, are the sizes of clustersc andk respectively. Thea factor is set empirically. Wecan see from Equation (21) that if a new cluster has many speakers then it is penalized morethan clusters with fewer speakers.

First, we used the 40 common sentences for all speakers to create speaker-adapted modelsfor each speaker using MLLR. We then ran the forward–backward algorithm using these40 sentences to calculate the probabilities logp(xm|λn). Having calculated the inter-speakerdistances, we ran the clustering algorithm to generate a varying number of clusters. The BTswere generated using the remaining 150 sentences for each speaker.

272 C. Bouliset al.

9. Generating basis transformations

We experimented with two different ways of generating the BTs. The first approach is to usethe data of all the speakers of a cluster and generate transformations for each cluster. All thetransformations from all the clusters are then used during adaptation. In this way, we haveenough data to estimate as many transformations as we wish. However, the transformationsare cluster-specific and do not represent well the adaptation functions for specific speakers.

The second approach was introduced to better match the conditions we intend to use withour adaptation algorithm. Since we have speaker adaptation, specific speakers are adapted andtherefore the cluster-specific transformations may not be a very good selection for the BTs. Inaddition to that, the speakers used for the evaluation of the method are not included during thetraining of the SI system while the basis transformations in the first case are generated usingspeakers included during the training process. Thus, having calculated an initial set of clusterswe continued by selecting a centroid speaker for each cluster. We then re-train a systemwithout the centroid speakers of the clusters and generate speaker-specific transformationson the new system. The centroid for each cluster is calculated by pooling the data of allspeakers of a cluster and using MLLR to estimate cluster-specific models. Then the forward–backward algorithm is run for the data of each speaker of a cluster. The speaker with thehighest probabilityp(xm|λn) (whereλn are the cluster-specific models) is selected as thecentroid speaker.

Each approach has a number of advantages and disadvantages. The second approach at-tempts to better match the conditions we will evaluate with our algorithm by estimatingspeaker-specific transformations on speakers that were not present during the training phase.However, since we remove some of the speakers during the construction of the new SI systemit will be less robust than the initial one. Since it is expected that a high number of clusterswill be used, the fraction of the data that will be removed may be important. Also, this ap-proach takes much more computing time from the first, since for each number of clusters adifferent SI system needs to be built.

The first approach does not need the construction of a new SI system for each number ofclusters but it estimates the BTs on different conditions than adaptation to new speaker. TheBTs are now cluster-specific, which means that the transformations are averaged over manyspeakers and also that they are included during the training of the SI system. An advantage ofthis approach is that we have available more data for each transformation and so more robustestimates can be obtained.

10. Experiments for basis transformations

We evaluated our algorithm using SRI’s DECIPHERTM system, built for the WSJ task. Theconfiguration of the system was described in Section 6. The test set is the generic develop-ment and testing set that is included in the 1993 distribution of the WSJ. We had available513 sentences from ten journalist speakers, almost 52 sentences per speaker. The first 20sentences from each speaker were left out to be used during adaptation and the remaining313 sentences consisted the actual testing set (5420 words). The speaker-independent word-error rate for the 313-sentences test set is 11·9% using a bigram language model. All subse-quent adaptation experiments are supervised: that is, the true transcription of each sentenceis known beforehand.

First we evaluated the MLLR performance using 1–20 adaptation sentences. The bestMLLR results are summarized in TableX. Note that different EM iterations and number


TABLE X. MLLR performance fordifferent numbers of adaptation

sentences

Number of sentences1 2 5 10 20

11·2 10·7 10·6 10·2 10·2

TABLE XI. Cluster-specific transformations for 9clusters, 3 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·8 11·7 11·2 11·3 11·3

10 11·7 11·6 11·4 11·2 11·330 11·7 11·6 11·5 11·2 11·4

of classes are needed to achieve the best MLLR results for a given number of sentences. Thebest MLLR results are obtained using block-diagonal transformations in all cases.

From TableX we observe that MLLR works very well for the specific adaptation and testset. With as little as one adaptation sentence it achieves a 6% relative improvement over theSI system. Also, it seems that MLLR saturates very quickly for this test set. With 10 and 20adaptation sentences we see no difference and with as few as two adaptation sentences wehave already obtained 70% of the total WER reduction.

For the first experiment of the basis transformations method we have generated nine clus-ters per gender. We then used the MLST algorithm with ten regression classes for the trans-formations and three transformations per class. During the generation of the basis transfor-mations using MLST, the number of regression classes for the transform weights was set toten for all experiments. The transformations were cluster-specific. That is, the data of all thespeakers of a cluster were pooled together to estimate the transformations. During adaptationthe new speaker will use 9× 3 = 27 transformations per class. We used the linear combina-tion grouping to return to the original number of Gaussians during all of the experiments. Wealso used a varying number of regression classes for the transform weights during adaptation.If the number of samples a weight class has received was below a threshold, then the back-off transform weight vector was used (one backoff weight class in all the experiments). Thethreshold was determined empirically at 40 samples for the current experiment. The resultsare shown in TableXI .

These first results are quite poor: little or no adaptation can be seen for limited adaptationdata and the method saturates quickly. Next we generated basis transformations for nineclusters per gender but one transformation per class (MLLR case). The transform weightthreshold was set to ten samples. All the other parameters were kept the same as in the firstexperiment. The results are shown in TableXII . Again we observe the same picture as in thefirst experiment: little or no adaptation for limited data; small improvement as the number ofadaptation data increases.

Next we used the second approach for generating the BTs. We used nine clusters per gen-der, determined the centroid speaker for each cluster, retrained a system without containingthe nine centroid speakers (18 for both genders) and generated speaker-specific transforma-

274 C. Bouliset al.

TABLE XII. Cluster-specific transformations for 9clusters, 1 transformation per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·7 11·6 11·4 11·3 11·3

10 11·7 11·6 11·4 11·4 11·330 11·7 11·7 11·4 11·3 11·2

TABLE XIII. Speaker-specific transformations for9 clusters, 1 transformation per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 20

1 12·0 12·3 11·4 11·5 11·4

10 12·3 12·0 11·6 11·5 11·4

30 11·8 11·9 11·6 11·6 11·5

tions. We used one transformation per class (MLLR case). The new SI system has a WERof 12·0% which shows no degradation with the 18 speakers removed from the training data.The results are shown in TableXIII .

These results are directly comparable with the results in TableXII . TableXIII shows evenpoorer results using the second approach. Even if the WER of the new SI system was notsignificantly worse, we noticed that during adaptation the acoustic probability of the adapta-tion sentences was significantly lower in the new SI system than in the original SI system.Although subsequent EM iterations improved the acoustic probability as expected, it neverreached the acoustic probability of the cluster-specific transformations. Thus, we concludethat removing speakers from the training data (even a moderate number such as 18) results ina worse initial system for our adaptation. The rest of our experiments were conducted usingthe original SI system.

Before proceeding to the next experiments we ran a set of diagnostic tests to have a morecomplete view of the method. First, we used as adaptation data the data of a training speakerfor which a BT was estimated. Using the second approach where we have speaker-specifictransformations we noticed that during adaptation the transform weights that correspondedto the speaker’s transformations were much higher than all other weights. This is a pleasantfact since it shows that indeed this is a speaker similarity scheme. Second, we kept all thetransform weights equal to see if there is any adaptation performed at all without updating thetransform weights. Using the cluster-specific transformations with three transformations perclass we obtained a 12·1% WER. Note that for ten adaptation sentences we achieve a 11·2%WER. This result shows that, indeed, adaptation is performed by allowing the updating of thetransform weights and also that the transformations themselves do not capture any mismatch,as opposed to the ATIS experiments for dialect speakers. Third, using the speaker-specifictransformations we selected as reference speakers not the centroid but random ones. Theresult was benchmarked at 13·0%, which shows that centroid speakers are more suitable toconstitute our basis.

Next we wanted to explore the influence of adding more transformations per class. We


TABLE XIV. Cluster-specific transformations for9 clusters, 7 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·5 11·3 11·1 11·1 10·9

10 11·4 11·2 11·0 11·0 10·930 11·4 11·2 11·0 11·0 11·1

TABLE XV. Speaker-specific transformations for9 clusters, 7 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·3 11·6 11·3 11·5 11·2

10 11·4 11·4 11·2 11·4 11·230 11·4 11·4 11·0 11·2 11·1

TABLE XVI. Speaker-specific transformations for23 clusters, 3 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·2 11·1 11·1 11·0 11·0

10 11·2 11·1 10·9 11·1 10·830 11·2 11·2 11·0 11·2 10·9

used cluster-specific transformations on nine clusters per gender, two regression classes forthe transformations and seven transformations per class. The transform weights threshold wasset to 100 samples. During adaptation the transform weight vector will now have 9× 7= 63elements. The results are shown in TableXIV .

The results of TableXIV are improved compared with the previous tables. Adaptation nowbecomes more obvious under any amount of adaptation data. Next, to compare the perfor-mance of cluster-specific transformations with speaker-specific transformations we generatedspeaker-specific transformations on the original SI system, without retraining. All the settingsare the same as for the experiments in TableXIV but now the centroid speakers are selectedinstead of using all the speakers of a cluster. The results are shown in TableXV.

Comparing TablesXV andXIV we notice no significant difference in the performance. Itseems that our method is quite insensitive to the methodology used to generate the transfor-mations.

Next we wanted to explore the fact of increasing the number of clusters. To have compa-rable results with TablesXV andXIV we should keep the same number of transformationsper class. We generated speaker-specific transformations, used 23 clusters per gender, tworegression classes for the transformations and three transformations per class (total numberof transformations per class during adaptation is 23× 3 = 64). The results are shown inTableXVI .

276 C. Bouliset al.

TABLE XVII. Speaker-specific transformationsfor 23 clusters, 4 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·3 11·2 11·1 10·9 10·9

10 11·2 11·1 11·0 11·0 11·130 11·2 11·1 11·0 11·2 11·2

TABLE XVIII. Speaker-specific transformationsfor 23 clusters, 3 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·2 11·5 11·1 11·3 10·9

10 11·2 11·4 11·0 11·3 11·230 11·1 11·0 11·2 11·3 11·4

Again we do not notice any clear difference between the two configurations. We can seethat our method saturates very quickly and that the performance is essentially the same forfive sentences and more. Comparing TablesXVI and TableXV we conclude that the numberof clusters is not as important as the total number of transformations used in adaptation.

In the next experiment we increased even more the number of transformations. We gen-erated speaker-specific transformations for 23 clusters per gender, two regression classes forthe transformations and four transformations per class (total number of transformations perclass during adaptation is 23× 4 = 92). The results are shown in TableXVII . A slight de-crease in performance is observed, which can be attributed to the fact that we have a highnumber of adaptation parameters and robust estimates cannot be obtained.

All the above experiments were conducted using block-diagonal transformations. We alsoran an experiment with the same setting as TableXVI (speaker-specific transformations,23 clusters per gender, two regression classes for transformations, three transformations perclass) but with full rotation matrices. The results are shown in TableXVIII .

Again the same picture is present. The transform type seems to play no major role in theperformance of our method. We notice a slight decrease in many adaptation sentences, whilethe results for few adaptation sentences are marginally better.

Next we wanted to explore the effect of increasing the number of regression classes fortransformations. We used a system with ten regression classes for transformations insteadof two (speaker-specific transformations, 23 clusters per gender, ten regression classes fortransformations, three transformations per class). The results are shown in TableXIX .

Almost the same results are obtained as in TableXVI . So the number of regression classesfor the transformations seems to be insignificant too.

Having completed this set of experiments we can draw some conclusions. First, the BTmethod is insensitive to many factors. The number of clusters per gender, the number ofregression classes for the transformations, the transform type (block-diagonal or full), themethodology used (cluster-specific or speaker-specific) and the number of transform weightclasses during adaptation seem to have only marginal impact on the performance. Second,the only factor that was observed to play an important role is the number of transformations


TABLE XIX. Speaker-specific transformations for23 clusters, 10 regression classes for transforma-

tions, 3 transformations per class

Number of sentences

#w

eigh

tcla

sses

1 2 5 10 201 11·5 11·5 11·4 11·0 10·9

10 11·2 11·1 11·0 10·9 11·030 11·4 11·2 11·1 11·4 11·3

TABLE XX. Best results for MLLRand BT under different amounts of

adaptation data

MLLR BT

#se

nten

ces 1 11·2 11·1

2 10·7 11·05 10·6 10·9

10 10·2 10·920 10·2 10·8

per class. Comparing TablesXVI andXI or XII we see a clear gain by increasing the numberof transformations per class. This is expected since during adaptation the new speaker willbe able to choose or interpolate from a wider range of suitable values.

In TableXX we summarize the best MLLR results and the best BT results. From TableXXwe observe that for limited adaptation data the two methods achieve the same performance,but as the number of adaptation sentences increases MLLR continues to increase its perfor-mance in contrast with the BT method which saturates very quickly. An interesting point tonote is that we have essentially the same performance in the BT method for 1–20 adaptationsentences. This was partly expected from the start since this method was designed to workunder very limited adaptation data.

A disadvantage of our method, as it was applied, is that it interpolates transformations ona predefined set of clusters. This generic set of clusters may not be optimal for every speaker.Of course, a new speaker using the BT method is able to determine its neighbours by thetransform weights but perhaps we can improve the performance if we allow the initial setof clusters to be specific for the new speaker. An implementation of this could be to havespeaker-adapted models for each new speaker (i.e. using MLLR). Then we run the forward–backward algorithm for each one of the training speakers and calculate the log probabilityof observing the data of the training speaker using the speaker-adapted models of the newspeaker. We then select the training speakers with the topN log probabilities and use themas our basis. However, this implementation is very time consuming and therefore unlikely tobe used in practice.

Another approach is to combine MLLR and BT. The method described in the previousparagraph can be seen as applying first MLLR and then BT. If we reverse their order, that is,apply first BT and then MLLR, we can have a manageable way of applying the compoundmethod. First the BT method is used to have an initial adaptation to the new speaker. Then,the transformed models are used as the initial system for MLLR. In this way, we can apply

278 C. Bouliset al.

TABLE XXI. Combination of BT andMLLR

MLLR BT + MLLR

#se

nten

ces 1 11·2 11·0

2 10·7 10·65 10·6 10·2

10 10·2 10·220 10·2 9·9

MLLR having a better starting point. The results using the combinations of BT and MLLRare shown in TableXXI .

The results in TableXXI show that although the combination of the two methods results inconsistently lower WER compared with TableXX the numbers are not statistically different(Gillick & Cox, 1989). This can be attributed to the fact that we perform adaptation to nativespeakers. This means that the mismatch between the speaker-independent models and eachspeaker is not as profound as for non-native speakers and so any adaptation method is notexpected to offer very high gains. Nevertheless, a 15% reduction in WER is observed byusing five adaptation sentences from each speaker.

11. Related work with basis transformations

Some adaptation schemes that attempt to exploit similarities between speakers have appearedin the literature. InPadmanabhan, Bahl, Nahamoo and Picheny(1995) speaker-dependentmodels for each one of the training speakers are constructed. Because the number of sen-tences available for each speaker is small, robust ML estimates cannot be obtained. To over-come this problem, a single Gaussian PDF is used for each context-dependent state and theMAP training algorithm is used for the estimation of the system parameters. When a newtest speaker is available for adaptation, the forward–backward algorithm for all the speaker-dependent systems is performed, using as input the adaptation data. The log probability ofobserving the adaptation data given each speaker-dependent system is calculated. The train-ing speakers are then ranked in the order of this probability and the topN speakers areselected. For each one of the selected speakers a MLLR transformation is estimated thatmaps the training speaker’s data to the new speaker. The counts that are necessary to estimatethe transformation are accumulated using not the rough speaker-dependent models but themore detailed speaker-independent system. After the data of each one of the selected trainingspeakers have been transformed they are pooled together and standard re-estimation tech-niques follow. Since we use data from many speakers, there are enough data to apply MAPor ML techniques. With this method we can augment the data of each speaker by transformingthe data of similar training speakers. In this way, we can achieve more detailed adaptation.

This method was tested on the WSJ task and proved to be superior to MLLR under littleadaptation data (three and nine adaptation sentences). The method presented inPadmanabhanet al. (1995) is extremely computationally intensive and has added storage requirements. Wesee that prior to adaptation, a forward–backward algorithm is run for all speaker-dependentsystems, then transformations are estimated for every selected speaker and finally standard re-estimation techniques follow. All these steps require excessive computing power and thus themethod can only be applied offline. Also, since the speaker-dependent systems are stored and


we have a high number of training speakers there are added storage requirements associatedwith this method.

In Gao, Padmanabhan and Picheny(1997) a variant ofPadmanabhanet al. (1995) is pre-sented. The work focuses on the time and space disadvantages ofPadmanabhanet al. (1995)and introduces a sequence of improvements. InGaoet al. (1997) there are predefined clus-ters and during adaptation the closestN clusters are chosen using Euclidean or Mahalanobisdistance. The cluster-models are then transformed to better fit the adaptation data and there-estimation techniques result in a more suitable set of models for the new speaker. With thepredefined clusters the time-consuming forward–backward algorithm step is alleviated. Also,the storage requirements are significantly decreased since now there are cluster-dependentmodels and therefore considerably smaller than speaker-dependent ones. In addition, thecluster-dependent models can be estimated more robustly than speaker-dependent ones sincethere are many more adaptation data associated with a cluster than with a speaker. The resultsusing this variant show a small improvement overPadmanabhanet al. (1995).

Similar work with the BT method is introduced inKuhn et al. (1999). The same princi-ple that underlies both methods is that there exists a basis of speakers that can adequatelycharacterize any other speaker by taking their linear combination. InKuhn et al. (1999) Tspeaker-dependent systems are first trained. The parameters of each speaker-dependent sys-tem are grouped to vectors of dimensionD. BecauseT is usually high, projection techniquesto lower dimension spaces are used. Using PCA we can select the firstK eigenvectors ofdimensionD of theT × D matrix and use them as the basis. TheseK eigenvectors are calledeigenvoices. A ML technique to estimate the eigenvalues during adaptation is given, simi-lar to the Baum–Welch algorithm. The algorithm is applied in the ISOLET speech databasewhere isolated letters are spoken from many speakers and it is shown to operate satisfactorily.However, the major disadvantage ofKuhnet al. (1999) is that it cannot scale its performanceto more complex tasks. A high number of speaker-dependent systems is needed which is im-possible to robustly train in tasks such as the WSJ. Also, the dimensionD is considered tobe small in tasks like ISOLET but in tasks such as WSJ this will be of huge dimension. Thememory requirements associated with this method for large-vocabulary continuous-speechsystems would become unbearable.

Also, various techniques based on dependencies between system parameters (Cox, 1995;Afify et al., 1997; Shahsahani, 1997) have been introduced in the past. These methods ap-proach the problem of rapid speaker adaptation by estimating a set of correlations betweenthe system parameters, using the training set. By establishing a dependency between the pa-rameters a smaller set of adaptation data are needed in order to achieve the same level ofrobustness of parameter values. These techniques have been shown to have comparable per-formance with MLLR but the dependencies estimated are usually simple (correlation ratios)and the adaptation equations essentially perform smoothing between many parameters.

12. Conclusions

Two speaker adaptation methods have been introduced in this paper. First we introduced nu-merous improvements to the MLST algorithm that led to a method with significant improve-ment over MLLR for adaptation to non-native speakers on WSJ. MLST estimates multiplelinear transforms per class and a transform weight vector per component. Because the trans-form weights are comprised of many fewer elements than transformations we can robustlyestimate them using far fewer data. In this way, we can apply many effective transformations

280 C. Bouliset al.

since each component will receive its own transformation by estimating a transform weightvector and using the transformations that are shared by many components. Thus, we achieveincreased adaptation resolution without sacrificing robustness. MLST removes a limitation ofMLLR that every component of a class must be transformed identically, yet retains a pleas-ant characteristic of MLLR that the transformation of a component will be influenced by itsneighbours.

A variant of the original MLST algorithm was also introduced, to operate under sparsityof data. The variant, BT, selects transformations estimated on other speakers and adapts toa new speaker by estimating the transform weights. BT can be seen as a speaker similarityscheme exploiting the similarities that exist across different speakers. BT was evaluated onnative speakers on the WSJ task. We applied clustering techniques to cluster the trainingspeakers into sets and then used two different ways of generating the basis transformations.The method was shown to be insensitive to a number of factors and the only factor that seemsto influence the performance is the number of transformations per class. It was experimentallyshown that when BT cascades with MLLR marginal gains can be achieved in comparisonwith using MLLR only.

The authors would like to thank Ananth Sankar for having the original concept of the basis transfor-mation method. This work is in continuation of the summer 1998 research team on ”“Rapid SpeakerAdaptation to New Speakers” held on CLSP at Johns Hopkins University.

References

Afify, M., Gong, Y. & Haton, J. (1997). Correlation based predictive adaptation of hidden Markov models.Proceed-ings of European Conference on Speech Communication and Technology 97,Rhodes, Greece, pp. 2059–2062.

Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilisticfunctions of Markov processes.Inequalities, 3, 1–8.

Boulis, C. & Digalakis, V. (2000). Fast speaker adaptation of continuous density HMM speech recognizer using abasis transform approach.Proceedings of International Conference on Acoustics, Speech and Signal Processing,Istanbul, June 2000.

Chesta, C., Siohan, O. & Lee, C.-H. (1999). Maximum a posteriori linear regression for hidden Markov modeladaptation.Proceedings of European Conference on Speech Communication and Technology, volume 1,pp. 211–214.

Cox, S. J. (1995). Predictive speaker adaptation in speech recognition.Computer Speech and Language, 9, 1–17.Diakoloukas, V. & Digalakis, V. (March 1999). Maximum likelihood stochastic transformation adaptation of hidden

Markov models.IEEE Transactions on Speech and Audio Processing.Digalakis, V. & Neumeyer, L. (July 1996). Speaker adaptation using combined transformation and Bayesian meth-

ods.IEEE Transactions on Speech and Audio Processing, 294–300.Digalakis, V., Rtichev, D. & Neumeyer, L. (September 1995). Speaker adaptation using constrained reestimation

of Gaussian mixtures.IEEE Transactions on Speech and Audio Processing, 357–366.Gales, M. (1997). Transformation smoothing for speaker and environmental adaptation.Proceedings of the European

Conference on Speech Communication and Technology, 2067–2070.Gales, M. & Woodland, P. C. (1996). Mean and variance adaptation within the MLLR framework.Computer Speech

and Language, 10, 249–264.Gao, Y., Padmanabhan, M. & Picheny, M. (1997). Speaker adaptation based on pre-clustering training speakers.

Proceedings of International Conference on Acoustics, Speech and Signal Processing 1997.Gauvain, J.-L. & Lee, C.-H. (January 1994). Maximum a posteriori estimation for multivariate Gaussian mixture

observations of Markov chains.IEEE Transactions on Speech and Audio Processing, 2, Part 1, 63–69.Gillick, L. & Cox, S. J. (1989). Some statistical issues in the comparison of speech recognition algorithms.Proceed-

ings of IEEE Conference on Acoustics, Speech and Signal Processing,Glasgow, pp. 532–535. May 1989.


Gunawardana, A. & Byrne, W. (2000). Robust estimation for rapid speaker adaptation using discounted likelihoodtechniques.Proceedings of International Conference on Acoustics, Speech and Signal Processing 2000.

Kuhn, R., Nguyen, P., Junqua, J. C., Boman, R., Niedzielski, N., Fincke, S., Field, K. & Contolini, M. (1999).Fast speaker adaptation using a priori knowledge.International Conference on Acoustics, Speech and SignalProcessing 1999.

Legetter, C. J. & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuousdensity hidden Markov models.Computer Speech and Language, 171–185.

Liporace, L. A. (September 1982). Maximum likelihood estimation for multivariate observations of Markov sources.IEEE Transactions on Information Theory, IT-28, 729–734.

Neumeyer, L., Sankar, A. & Digalakis, V. (1995). A comparative study of speaker adaptation techniques.Proceed-ings of the European Conference on Speech Communication and Technology 1995, pp. 1127–1130.

Padmanabhan, M., Bahl, L. R., Nahamoo, D. & Picheny, M. (1995). Speaker clustering and transformation forspeaker adaptation in large vocabulary speech recognition systems.Proceedings of International Conference onAcoustics, Speech and Signal Processing 1995, pp. 701–704.

Paul, D. & Baker, J. (1992). The design for the wall street journal-based CSR corpus.Proceedings of the DARPASpeech Natural Language Workshop, pp. 357–362. February 1992.

Sankar, A., Beaufays, F. & Digalakis, V. (1995). Training data clustering for improved speech recognition.Proceed-ings of European Conference on Speech Communication and Technology 1995.

Sankar, A. & Lee, C.-H. (May 1996). A maximum likelihood approach to stochastic matching for robust speechrecognition.IEEE Transactions on Speech and Audio Processing, 190–202.

Shahsahani, B. (March 1997). A Markov random field approach to Bayesian speaker adaptation.IEEE Transactionson Speech and Audio Processing.

Wang, Z. & Liu, F. (1999). Speaker adaptation using maximum likelihood model interpolation.Proceedings ofEuropean Conference on Speech Communication and Technology 1999.

(Received 8 January 2001 and accepted for publication 10 May 2001)

Appendix

In this section we derive the estimation formulae for both the transform weights and thetransforms under arbitrary tying.

Assume the adaptation data is a series ofT observations generated by a stochastic pro-cessO,

O = o1 . . . oT . (A1)

Denote the current set of model parameters byπ and a re-estimated set of model parametersasπ . The sequences of states, Gaussian components and transforms used to generateO aregiven respectively by

θ = (θ0θ1 . . . θT ) (A2)

ω = (ω0ω1 . . . ωT ) (A3)

λ = (λ0λ1 . . . λT ) (A4)

where(θ0ω0λ0) = (1,1,1). The likelihood of generating the observed speech frames whilefollowing the state sequenceθ , Gaussian component sequenceω and transform sequenceλis given by

F(O, θ, ω, λ | π) = aθT N

T∏t=1

aθt−1θt pθtωtwθtωtλt bθtωtλt (ot ) (A5)

whereai j is the transition probability from statei to state j , N is the exit state,pi j =

p(ωt = j | θt = i ) is the probability of selecting Gaussian componentj given the statei andwi jk = p(λt = k | θt = i, ωt = j ) is the probability of selecting transformk

282 C. Bouliset al.

given statei and Gaussian componentj . The termbi jk (ot ) is the output probability of statei , Gaussian componentj and transformk, for frameot . Assuming Gaussian densities foroutput observation probabilities we have

bi jk (ot ) =1

(2π)n2 |Si j |

12

e−12 (O−Wi jk µ̂i j )

′S−1i j (O−Wi jk µ̂i j ) (A6)

wheren is the dimension of each speech frame,Ci j is the covariance matrix for statei ,Gaussian componentj , µ̂i j = [1µ′i j ]

′ (µ′i j is the transpose of the mean vector of stateiGaussian componentj ) andWi jk = [bi jk | Ai jk ] is thekth transform for statei and Gaussiancomponentj .

If all possible state sequences, Gaussian component sequences and transform sequences oflengthT are denoted by the sets2,�,3 respectively, the total likelihood of the model setgenerating the observation sequence is

F(O | π) =∑θ∈2

∑ω∈�

∑λ∈3

F(O, θ, ω, λ | π). (A7)

This is the objective function to be maximized during adaptation. It is convenient to definean auxiliary functionQ(π, π):

Q(π, π) =∑θ∈2

∑ω∈�

∑λ∈3

F(O, θ, ω, λ | π) log F(O, θ, ω, λ | π). (A8)

Choosing model parameters to maximize the auxiliary function increases the value of the ob-jective function (unless it is at a maximum). Therefore, successively forming a new auxiliaryfunction with improved parameters iteratively maximizes the objective function. A proof ofthis is given inBaum(1972) and extended to mixture distributions and vector observationsin Liporace(1982).

Using the re-estimated parameters in the output density function,

log F(O, θ, ω, λ | π) = log

[aθT N

T∏t=1

aθt−1θt 1pθtωtwc1λt bθtωtλt (ot )

]

= logaθT N +

T∑t=1

logaθt−1θt +

T∑t=1

log pθtωt+

T∑t=1

logwθtωtλt

+

T∑t=1

logbθtωtλt (ot ). (A9)

So we can write

Q(π, π) =N∑

i=1

Nω∑j=1

Nλ∑k=1

Qai [π, {ai j }Nj=1] +

N∑i=1

Nω∑j=1

Nλ∑k=1

Qb(π,bi jk ) (A10)

whereQai bπ, {ai j }Nj=1c depends only on the transition probabilitiesai j and since they are

not adapted this term can be ignored. The second term can be written as

Qb(π,bi jk , wi jk , pi j )

=

∑θ∈2

∑ω∈�

∑λ∈3

T∑t=1

F(O, θt = i, ωt = j, λt = k | π)

(log pi j + logwi jk + logbi jk (ot )). (A11)


We can now define the quantity

γi jk (t) =1

F(O | π)

∑θ∈2

∑ω∈�

∑λ∈3

F(O, θt = i, ωt = j, λt = k | π). (A12)

So Equation (A11) is written as

Qb(π,bi jk , wi jk , pi j )

= F(O | π)

(T∑

t=1

γi jk (t) log pi j +

T∑t=1

γi jk (t) logwi jk +

T∑t=1

γi jk (t) logbi jk

).(A13)

The first term is independent of the adaptation parameters so it can be ignored. Thus, theobjective function to be maximized is

Q(π, π) ≺N∑i

Nω∑j

Nλ∑k

T∑t

γi jk (t) logwi jk +

N∑i

Nω∑j

Nλ∑k

T∑t

γi jk (t) logbi jk (ot ). (A14)

To have sensible estimates of bothwi jk andbi jk (ot ) we must introduce some form of ty-ing. We observe from Equation (A14) that transform weights are decoupled from transformsso they can be estimated independently. We assumeR1 regression classes and a mappingschemeg1(i ) = r1 which maps statei to classr1 for the transform weights. Accordingly weassumeR2 regression classes and a mapping schemeg2(i ) = r2 for the transforms. Underthe presence of tying, Equation (A14) is written as

Q(π, π)

≺

R1∑r1

∑i :g1(i )=r1

Nω∑j

Nλ∑k

T∑t

γ i jk (t) logwr1k +

R2∑r2

∑i :g2(i )=r2

Nω∑j

Nλ∑k

T∑t

γi jk (t) logbi jk (ot )

(A15)

where

bi jk (ot ) =1

(2π)n2 |Si j |

12

e−12 (ot−Wr2kµ̂i j )

′S−1i j (ot−Wr2kµ̂i j ) (A16)

andWr2k = bbr2k | Ar2kc, µ̂s j = [1µ′s j]′ (µ′s j is the transpose of theµs j vector).

In addition, thewr1k is subject to the constraint

Nλ∑k

wr1k = 1. (A17)

The method of Lagrange multipliers can be used to find a maximum of a function underconstraints. We define the augmented term

Q′ =R1∑r1

∑i :g1(i )=r1

Nω∑j

Nλ∑k

T∑t

γi jk (t) logwr1k + Lr1

(Nλ∑k

wr1k − 1

). (A18)

284 C. Bouliset al.

To find the maximum ofwr1k we take the derivative ofQ′ and equate it to zero:

d Q′

dwr1k=

1

wr1k

∑i :g1(i )=r1

Nω∑j

T∑t

γi jk (t)

+ Lr1 = 0⇔ wr1k

= −

∑i :g1(i )=r1

∑Nωj

∑Tt γi jk (t)

Lr1

. (A19)

Substituting Equation (A19) in Equation (A17) we find

Lr1 = −

∑i :g1(i )=r1

Nω∑j

Nλ∑k

T∑t

γi jk (t) (A20)

and substituting Equation (A20) in Equation (A19) we finally conclude the following formulafor the estimation of transform weights under tying:

wr1k =

∑i :g1(i )=r1

∑Nωj

∑Tt γi jk (t)∑

i :g1(i )=r1

∑Nωj

∑Nλk

∑Tt γi jk (t)

. (A21)

For the second term of Equation (A14) we have

logbi jk (ot )

= −n

2log(2π)−

1

2log |Ci j | −

1

2(ot −Wr2kµ̂i jk )

′S−1i j (ot −Wr2kµ̂i jk ). (A22)

To maximize Equation (A14) for the transformWr2k it is necessary to differentiate (A14)with respect toWr2k and equate it to zero. Thus, for a maximum,

d Q(π, π)

dWr2k=

∑i :g2(i )=r2

Nω∑j

T∑t

γi jk (t)d

dWr2k

{−

1

2(ot −Wr2kµ̂i j )

′S−1i j (ot −Wr2kµ̂i j )

}= 0

⇔

∑i :g2(i )=r2

Nω∑j

T∑t

γi jk (t)(−S−1i j (ot −Wr2kµ̂i j )µ̂i j ) = 0⇔ . (A23)

Finally, we conclude the following equation for the estimation of transformWr2k:

∑i :g2(i )=r2

Nω∑j

T∑t

γi jk (t)S−1i j ot µ̂

′

i j =∑

i :g2(i )=r2

Nω∑j

T∑t

γi jk (t)S−1i j Wr2kµ̂i j µ̂

′

i j . (A24)

Assuming thatSi j ∀i, j is in diagonal form (every non-diagonal element is zero) and thatWr2k is full to estimateWr2k we must solven+ 1 linear systems (wheren is the dimensionof each speech frame) ofn+1 equations each. The solution of each linear system correspondsto each row ofWr2k. That is,∀ρ ∈ [1,n+ 1] we solve the linear system

G(ρ)w(ρ) = z(ρ) (A25)

whereG(ρ) is a(n+ 1)× (n+ 1) matrix with elements

G(ρ)(l ,m) = G(ρ)(m, l )∑

i :g2(i )=r2

Nω∑j

µ̂(l )i j µ̂

(m)i j

1

S(ρ)i j

T∑t

γi jk (t) ∀l ,m (A26)


wherez(ρ) is a(n+ 1)× 1 vector with elements

z(ρ)(l ) =∑

i :g2(i )=r2

µ̂(l )i j

1

S(ρ)i j

T∑t

γi jk (t)o(ρ)t ∀l (A27)

andw(ρ) is theρ row of the matrixWr2k. In Equations (A26) and (A27) each superscriptldenotes thel th element of that vector.

The set of linear systems can be solved using singular value decomposition (SVD), whichis a robust method for solving linear systems. If a matrixG(ρ) is close to singular then SVDwill diagnose and in most cases sufficiently remedy the situation. The superiority of SVDover other numerical algorithms (such as LU decomposition) was proven in practice.

Maximum likelihood stochastic transformation adaptation...

Documents

Transcript of Maximum likelihood stochastic transformation adaptation...