Multi-stream speech recognition based on Dempster–Shafer combination rule

10
Multi-stream speech recognition based on Dempster–Shafer combination rule Fabio Valente IDIAP Research Institute, CH-1920 Martigny, Switzerland Received 4 February 2009; received in revised form 23 August 2009; accepted 16 October 2009 Abstract This paper aims at investigating the use of Dempster–Shafer (DS) combination rule for multi-stream automatic speech recognition. The DS combination is based on a generalization of the conventional Bayesian framework. The main motivation for this work is the similarity between the DS combination and findings of Fletcher on human speech recognition. Experiments are based on the combina- tion of several Multi Layer Perceptron (MLP) classifiers trained on different representations of the speech signal. The TANDEM frame- work is adopted in order to use the MLP outputs into conventional speech recognition systems. We exhaustively investigate several methods for applying the DS combination into multi-stream ASR. Experiments are run on small and large vocabulary speech recogni- tion tasks and aim at comparing the proposed technique with other frame-based combination rules (e.g. inverse entropy). Results reveal that the proposed method outperforms conventional combination rules in both tasks. Furthermore we verify that the performance of the combined feature stream is never inferior to the performance of the best individual feature stream. We conclude the paper discussing other applications of the DS combination and possible extensions. Ó 2009 Elsevier B.V. All rights reserved. Keywords: TANDEM features; Multi Layer Perceptron; Multi-stream speech recognition; Inverse-entropy combination 1. Introduction Multi-band and multi-stream (Bourlard and Dupont, 1996; Hermansky et al., 1996) speech recognition are based on the combination of information obtained from several bands or feature streams. The rationale behind multi- stream is that noise or mismatch do not affect in similar ways the different feature sets. The use of many features obtained according to different representations of the speech signal can significantly enhance recognition robust- ness in adverse or mismatched acoustic conditions. The combination can happen at different levels, e.g. feature level, Hidden Markov Model (HMM) states, syllables or sentences. Multi-band and multi-stream ASR are motivated by work of (Fletcher, 1953) on human perception. According to (Fletcher, 1953; Allen, 2005), humans process speech in different frequency sub-bands independently and make a decision combining information coming from each sub- band. In (Fletcher, 1953), it is argued that the total error is equal to the product of errors in each independent sub- band. This means that to recognize correctly a speech sound, it is enough to recognize it correctly in one of the available sub-bands. Extended to the multi-stream frame- work, the ideal classifier combination should classify cor- rectly a speech sound if it is recognized correctly at least in one of the streams even if others are corrupted or unreliable. This paper is based on previous works (Bourlard and Dupont, 1996; Hermansky et al., 1996) that have addressed the problem of combining output of classifiers (typically a Multi Layer Perceptron, MLP) trained using different input features. MLPs are trained in order to discriminate in between phonemes (Bourlard and Morgan, 1994) and their output can be interpreted as the posterior probability of a 0167-6393/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2009.10.002 E-mail address: [email protected] www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication 52 (2010) 213–222

Transcript of Multi-stream speech recognition based on Dempster–Shafer combination rule

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 52 (2010) 213–222

Multi-stream speech recognition based onDempster–Shafer combination rule

Fabio Valente

IDIAP Research Institute, CH-1920 Martigny, Switzerland

Received 4 February 2009; received in revised form 23 August 2009; accepted 16 October 2009

Abstract

This paper aims at investigating the use of Dempster–Shafer (DS) combination rule for multi-stream automatic speech recognition.The DS combination is based on a generalization of the conventional Bayesian framework. The main motivation for this work is thesimilarity between the DS combination and findings of Fletcher on human speech recognition. Experiments are based on the combina-tion of several Multi Layer Perceptron (MLP) classifiers trained on different representations of the speech signal. The TANDEM frame-work is adopted in order to use the MLP outputs into conventional speech recognition systems. We exhaustively investigate severalmethods for applying the DS combination into multi-stream ASR. Experiments are run on small and large vocabulary speech recogni-tion tasks and aim at comparing the proposed technique with other frame-based combination rules (e.g. inverse entropy). Results revealthat the proposed method outperforms conventional combination rules in both tasks. Furthermore we verify that the performance of thecombined feature stream is never inferior to the performance of the best individual feature stream. We conclude the paper discussingother applications of the DS combination and possible extensions.� 2009 Elsevier B.V. All rights reserved.

Keywords: TANDEM features; Multi Layer Perceptron; Multi-stream speech recognition; Inverse-entropy combination

1. Introduction

Multi-band and multi-stream (Bourlard and Dupont,1996; Hermansky et al., 1996) speech recognition are basedon the combination of information obtained from severalbands or feature streams. The rationale behind multi-stream is that noise or mismatch do not affect in similarways the different feature sets. The use of many featuresobtained according to different representations of thespeech signal can significantly enhance recognition robust-ness in adverse or mismatched acoustic conditions. Thecombination can happen at different levels, e.g. featurelevel, Hidden Markov Model (HMM) states, syllables orsentences.

Multi-band and multi-stream ASR are motivated bywork of (Fletcher, 1953) on human perception. According

0167-6393/$ - see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2009.10.002

E-mail address: [email protected]

to (Fletcher, 1953; Allen, 2005), humans process speech indifferent frequency sub-bands independently and make adecision combining information coming from each sub-band. In (Fletcher, 1953), it is argued that the total erroris equal to the product of errors in each independent sub-band. This means that to recognize correctly a speechsound, it is enough to recognize it correctly in one of theavailable sub-bands. Extended to the multi-stream frame-work, the ideal classifier combination should classify cor-rectly a speech sound if it is recognized correctly at leastin one of the streams even if others are corrupted orunreliable.

This paper is based on previous works (Bourlard andDupont, 1996; Hermansky et al., 1996) that have addressedthe problem of combining output of classifiers (typically aMulti Layer Perceptron, MLP) trained using different inputfeatures. MLPs are trained in order to discriminate inbetween phonemes (Bourlard and Morgan, 1994) and theiroutput can be interpreted as the posterior probability of a

214 F. Valente / Speech Communication 52 (2010) 213–222

phoneme conditioned to the acoustic observation. Poster-ior estimates from multiple classifiers can be combinedaccording to probabilistic rules. Ideally the combined pos-terior stream should produce results never inferior to thoseof the best individual feature stream.

In literature a large number of combination rules havebeen proposed: sum, product, maximum, minimum rules(see Kittler et al., 1998) and inverse entropy weighting(Misra et al., 2003). Sum and product combinations arediscussed and motivated in (Kittler et al., 1998) and arebased on the fact that classifiers outputs may be indepen-dent or not. In (Misra et al., 2003) a combination basedon the entropy of the classifier output has been proposedand referred as inverse entropy combination.

If enough in-domain training data are available, thecombination can also happen trough the use of anotherclassifier, trained to produce a single posterior estimates(Hermansky and Malayath, 1998) or trough a classifierthat selects the most robust streams (Kleinschmidt, 2002).Main weakness of the classifier based combination is therobustness to mismatches. In fact if the testing conditionsdo not match the conditions in which the merger classifierhas been trained, typically the combination produces poorresults. Here it comes the interest for frame-based rulesthat combine streams without any further training anddynamically adjust the feature weights.

Extensions of the multi-stream framework have beenproposed for large vocabulary applications. In particularthe combination of classifiers trained on short-term andlong-term speech features have been proven effective in sev-eral LVCSR tasks (Stolcke et al., 2006; Morgan et al.,2004; Plahl et al., 2009).

This paper aims at investigating a combination rulewhich is inspired from the human way of processing sepa-rate feature streams (Fletcher, 1953) and cannot be reducedto linear or log-linear weighting of phoneme posterior esti-mates. The combination is based on Dempster–Shafer (DS)theory of evidence (see Shafer et al., 1976) which can beconsidered an extension of conventional Bayesian proba-bility. In DS theory the combination of information hap-pens through the orthogonal sum of basic probabilityassignments. The main motivation of this work is the sim-ilarities between the rule derived from Fletcher and the DSorthogonal sum.

We discuss here several ways of applying the DS combi-nation rule to the output of MLP classifiers. Expanding ourprevious work (Valente and Hermansky, 2007; Valenteet al., 2007), we study extensively its performance on smalland large vocabulary speech recognition tasks and com-pare it with conventional frame-based rules proposed in lit-erature. Two different sets of experiments on small andlarge vocabulary tasks are proposed for this purpose.Experiments will reveal that DS combination produce errorrates never higher then those of the best feature stream,even when the performance of individual streams is signif-icantly different. This is not always verified for conven-tional frame-based combination rules.

The reminder of the paper is organized as follows: Sec-tion 2 briefly introduces multi-stream ASR and conven-tional combination rules. Section 3 presents Dempster–Shafer theory and the Dempster–Shafer orthogonal sum.Section 4 draws a parallel with work of Fletcher on humanspeech recognition. In Section 5 we describe several meth-ods for applying DS combination to the output of MLPclassifiers. Section 7 describes experiments using the pro-posed rules and Section 8 concludes the paper.

2. Multi-stream ASR

Multi-stream speech recognition takes advantage ofinformation coming from different feature streams. Thiswork is based on the same setting of (Bourlard andDupont, 1996; Hermansky et al., 1996).

Whenever multiple feature sets are available, an inde-pendent MLP is trained for each of them in order to dis-criminate between phonemes. The training is done usingback-propagation for minimizing the cross entropybetween MLP outputs and phonetic targets (Bourlardand Morgan, 1994). MLPs output can thus be consideredas an estimate of the phoneme posterior probability condi-tioned to the acoustic observation vector. We will limit ourinvestigation to just two feature streams without loss ofgenerality.

Let us denote with H ¼ fh1; . . . ; hkg a set of k phonemes,X a ¼ fxatg and X b ¼ fxbtg two different feature sets where t

denotes a time index. Given two MLPs trained on X a andX b, they will produce phoneme posterior probabilitiespiat ¼ pðhijxatÞ and pibt ¼ pðhijxbtÞ, i.e. the posterior distribu-tion of phonemes hi at time t given the input vectors xat andxbt.

Sum and product rules previously proposed in literatureproduce the following posterior estimates pcit:

pcit ¼ ðpiat þ pibtÞ=Zt ð1Þpcit ¼ ðpiat � pibtÞ=Zt ð2Þ

where Zt is a normalization constant.In (Misra et al., 2003), it was observed that the entropy

of the MLP output HðpÞ ¼ �P

ipi logðpiÞ increases with thelevel of SNR meaning that the posterior estimate pðHjX Þconverges towards a uniform, non-informative distributionover the phonemes. Thus entropy values HðpatÞ ¼ �

Pipiat

log piat and HðpbtÞ ¼ �P

ipibt log pibt can provide a confi-dence measure related to how feature streams X a and X b

are affected by the noise or the mismatch. Those findingsinspired two linear weighting schemes referred as minimum

entropy and inverse entropy combinations (Misra et al.,2003).

Linear combinations produce the following posteriorestimate:

pcit ¼ ðwat � piat þ wbt � pibtÞ ð3Þ

F. Valente / Speech Communication 52 (2010) 213–222 215

where wat; wbt P 0 and wat þ wbt ¼ 1. In minimum entropy

combination (Misra et al., 2003), the stream with the min-imum entropy receives weight one, i.e.

xat ¼ 1; xbt ¼ 0 if HðpatÞ < HðpbtÞxat ¼ 0; xbt ¼ 1 if HðpatÞ > HðpbtÞ

ð4Þ

This is equivalent to selecting the feature stream with thelowest entropy thus the more confident. If HðpatÞ ¼HðpbtÞ, the method randomly selects one of the streams.In inverse entropy combination, the weights are set inver-sely proportional to the value of the entropy, i.e.

xat ¼1=HðpatÞ

1=HðpatÞ þ 1=HðpbtÞ; xbt ¼

1=HðpbtÞ1=HðpatÞ þ 1=HðpbtÞ

:

ð5Þ

In contrary to minimum entropy which operates an “hard”

decision, inverse entropy gives highest weight to low entro-py distributions in a “soft” way.

Sum and inverse entropy are linear combination rules.The product rule can be considered a log-linear averageof phoneme posteriors.

If wat ¼ wbt ¼ 0:5, inverse entropy is equivalent to thesum rule, i.e. the combination simply consists in averagingthe posterior estimates. The performance of inverseentropy combination can be further improved using a setof static or dynamic thresholds as described in (Misraet al., 2003). Inverse entropy has been proven effective bothin small and large vocabulary systems (Morgan et al.,2004). The combined posterior estimates pit can be directlyused for speech recognition using hybrid decoding (Bour-lard and Morgan, 1994) or used as features in conventionalHidden Markov Model/Gaussian Mixture Model (HMM/GMM) systems trough the TANDEM scheme (Hermanskyet al., 2000). In the upcoming derivations, we will drop thetime index t for simplifying the notation.

In the following we will briefly introduce the DS theoryand the related combination rules which cannot be re-con-ducted to any of the previous linear or log-linearcombinations.

3. The Dempster–Shafer theory of evidence

The Dempster–Shafer (DS) theory of evidence (see Shaferet al., 1976) allows representation and combination of differ-ent measures of evidence. It can be considered as a general-ization of the Bayesian framework and permits thecharacterization of uncertainty and ignorance.

Let H ¼ fh1; . . . ; hkg be a finite set of mutually exclusiveand exhaustive hypotheses refereed as singletons. H isreferred as frame of discernment. Let 2H be the power setof H, i.e. the set of all subsets of H. A basic probability

assignment (BPA) is a function m from 2H to ½0; 1� such that

m : 2H ! ½0; 1�;XA�H

mðAÞ ¼ 1 and mðøÞ ¼ 0 ð6Þ

mðAÞ can be interpreted as the amount of belief that isassigned exactly to A and not to any of its subsets. In prob-ability theory, a measure is assigned only to atomic hypoth-esis mðhiÞ while in DS theory it can be assigned to a set A

without any further commitment on the atomic hypothesisthat compose A.

Let :A be complementary set of A, i.e. the set fH� Ag.In DS theory, mðAÞ þ mð:AÞ < 1 (contrarily to probabilitytheory), which means that we can consider an amount ofbelief that is not attributed to an hypothesis nor to its nega-tion. In other words, “we don’t need to over-commit whenwe are ignorant”.

The situation of total ignorance is represented bymðHÞ ¼ 1 where all the mass is assigned to the entirehypothesis set without commitment on its subsets. On theother hand, if we set mðhiÞ – 0 only for hi and mðAÞ ¼ 0for all A – hi, we recover the conventional Bayesianframework.

The belief function or credibility of A is the function thatassigns to each subset A, the sum of all basic probabilitynumbers of its subset:

BelðAÞ ¼XB�A

mðBÞ ð7Þ

Subset A for which mðAÞ > 0 are called focal elements andtheir union is called core. A belief function is defined asvacuous if it has only H as focal element. A belief functionis defined as simple support function if it has only one focalelement in addition to H and Bayesian if its focal elementsare singleton.

Another interesting point in DS theory is how two dif-ferent belief functions Bel1 and Bel2 over the same frameof discernment are combined into a single belief function.Dempster’s rule states that Bel1 and Bel2 are combinableif their cores are not disjoint. Given m1 and m2 BPAs asso-ciated with Bel1 and Bel2 this condition can be expressed asP

A\B¼øm1ðAÞm2ðBÞ < 1. In this case m1 and m2 can be com-bined as:

mðøÞ ¼ 0; mðhÞ ¼P

A\B¼hm1ðAÞm2ðBÞ1�

PA\B¼øm1ðAÞm2ðBÞ

ð8Þ

The combination mðhÞ is also a BPA. The belief functiongiven by m is called orthogonal sum of Bel1 and Bel2 de-noted as Bel1 � Bel2 (m as well is denoted as m1 � m2).DS orthogonal sum is both associative and commutative.Given two belief functions Bel1 and Bel2, if Bel1 is vacuous,then Bel1 � Bel2 ¼ Bel2; if Bel1 is Bayesian, thenBel1 � Bel2 is also Bayesian.

Let us consider now the case of orthogonal sum betweentwo simple support belief functions Bel1 and Bel2 withfocus A – H, i.e. m1ðAÞ ¼ s1, m1ðHÞ ¼ 1� s1, m2ðAÞ ¼ s2,m2ðHÞ ¼ 1� s2. Applying DS orthogonal sum (8), weobtain:

mðHÞ ¼ ð1� s1Þð1� s2Þ;mðAÞ ¼ 1� ð1� s1Þð1� s2Þ ð9Þ

In words, in case of simple support belief functions, the“ignorance” mðHÞ is the product of ignorances of single

216 F. Valente / Speech Communication 52 (2010) 213–222

beliefs. In next section, we draw a parallel with product oferrors.

4. Product of errors

In (Fletcher, 1953), Fletcher investigated human recog-nition of nonsense syllables and defined the term articula-

tion as the probability of correct recognition for nonsensespeech sounds. It was verified that after high and low passfiltering, respective articulations sL and sH do not sum tothe articulation of the wide-band signal s, i.e. s – sL þ sH

. Thus it was hypothesized the existence of a non-lineartransform F such that the additive relation

F ðsÞ ¼ F ðsLÞ þ F ðsH Þ ð10Þis verified. The cut-off frequency f � at which the articula-tions are equal i.e. sðf �Þ ¼ sLðf �Þ þ sH ðf �Þ was empiricallyfound through experiments on human speech recognition.At this frequency, articulations are similar, thus it is possi-ble to write:

0:5F ðsðf �ÞÞ ¼ F ðsLðf �ÞÞ ¼ F ðsH ðf �ÞÞ ð11ÞFletcher empirically identified the functional form of F ð:Þas given by:

F ðsÞ ¼ logð1� sÞlogð1� 0:985Þ ¼ logð1� sÞc with

c ¼ 1

logð1� 0:985Þ ð12Þ

where 0.985 is the maximum observable articulation inideal conditions. The function F ð:Þ was referred as articula-

tion index. Using expression (12) into expression (11), it ispossible to derive the following relation:

ð1� sÞ ¼ ð1� sLÞð1� sHÞ ð13Þi.e. the total articulation error is equal to the product ofarticulation errors in each individual sub-band. Thisexpression is also known as “product of errors”. For in-stance, if the error in one band is equal to 0.9 and 0.1 inthe other band, the total recognition error will be0:9� 0:1 ¼ 0:09. In other words, the total error is closeto the lowest error obtained trough the individual bands.

Later (Allen, 2005, Allen interpreted that each phonemeis independently recognized in each sub-band and the inde-pendent phone estimates are then combined to yield thefinal recognition. Thus in order to recognize correctly aphoneme it is enough to recognize it correctly in one ofthe available sub-bands. Those relations were laterextended to more than two sub-bands.

It is possible to notice the similarity between expressions(13) (derived from human studies) and (9) (derived troughDempster–Shafer combination rule).

Related works on multi-stream and multi-band ASRhave typically considered linear and log-linear weightingof the different streams. If DS combination is used underthe assumption of simple belief function, the rule is similarto what has been observed in human perception. The

appeal of using such a rule should consist in achieving per-formances always better or equal to those of the best indi-vidual stream. This work is mainly motivated from thesimilarity between the DS combination rule and the Fletch-er’s product of error rule. In next section we discuss how totransform the MLP output into a Basic Probability Assign-ment (BPA).

5. From posteriors to belief

DS combination has already been largely studied in themachine learning community for combining the probabilis-tic output of Neural Networks (e.g., see Xu et al., 1992;Mandler and Schurman, 1988).

As before, let us consider the output of a MLP trainedto estimate phoneme posteriors fp1t ¼ pðh1jxtÞ; . . . ; pk ¼pðhkjxtÞg with

Pipi ¼ 1 where xt is an observation vector

at time t. In order to apply DS combination, the MLP out-puts must be converted into Basic Probability Assignments(BPA).

If we consider each MLP output pi as a source of infor-mation on phoneme hi, focal elements of the correspondingBPA are: miðhiÞ, i.e. the belief we have in the hypothesisassociated with the ith phoneme, mið:hiÞ, i.e. the beliefwe have in the complementary set of this hypothesis andmiðHÞ, i.e. the ignorance associated with this hypothesis.The time index t is dropped for simplifying the notation.Furthermore the following relation must be satisfied:

miðhiÞ þ mið:hiÞ þ miðHÞ ¼ 1 ð14Þ

As in the case of inverse entropy combination we makethe hypothesis here that BPAs are related to the confidenceof the classifier. The confidence can be estimated as a func-tion of the MLP output entropy at a given time instant t.Similarly to inverse entropy combination, we measure theconfidence using some function of the MLP output entropyH ¼ �

Pki pilogðpiÞ. Intuitively, we can expect ignorance to

be total (i.e. mðHÞ ¼ 1) when entropy of the outputachieves its maximum value Hmax ¼ �

Pki

1k log 1

k

� �, i.e. when

the phoneme posterior distribution is a uniform distribu-tion over all the phonemes. Similarly we can expect theignorance to be zero (mðHÞ ¼ 0) when the entropy is zeroH ¼ 0, i.e. when the classifier is confident on the classifica-tion of a given phoneme.

Under those considerations three possible choices forBPAs are possible:

1. A first set of BPA can be obtained setting:

miðhiÞ ¼ api mið:hiÞ ¼ aXj–i

pj

!ð15Þ

miðHÞ ¼ 1� miðhiÞ � mið:hiÞ ¼ 1� a ð16Þ

with a ¼ 1� HH max

� �c

ð17Þ

F. Valente / Speech Communication 52 (2010) 213–222 217

Contrarily to probability theory, miðhiÞ þ mið:hiÞ ¼a – 1 because a certain amount of belief is supposedto be assigned to entire phoneme set H.If the entropyof the classifier is zero, we obtain miðhiÞ ¼ pi,mið:hiÞ ¼

Pj–ipj and mðHÞ ¼ 0. On the other hand if

H ¼ Hmax, we have miðhiÞ ¼ mið:hiÞ ¼ 0, i.e. mðHÞ ¼ 1which is the situation of total ignorance on the phonemeset h.Exponent factor c is supposed to introduce an extradegree of freedom because the relation between therelated arguments may not be linear (see Section 7.3).It can be interpreted as a multiplicative factor in thelog-domain as the one in expression (12). We will verifyin the experiments section that the combination is effec-tive for all c < 1.

2. A second set of BPA can be obtained by considering twosimple support belief functions, i.e.:

miðhiÞ ¼ api; miðHÞ ¼ 1� api ¼ 1� miðhiÞ ð18Þ

with a ¼ 1� HH max

� �c

ð19Þ

In this case the BPA does not use the information ob-tained on the complementary set :hi. Combination ofsimple support belief functions lead to a “product oferrors” like combination as shown in Eq. (9).

3. A third set of BPA can be directly derived from orthog-onal sum of BPAs (18) coming from the same MLP out-put. In fact BPA from each MLP output as defined in(18) are combinable; applying orthogonal sum (8)ð�imiÞ a new set of BPA can be directly obtained:

miðhiÞ ¼ api ð20Þ

mðhiÞ ¼ miðhiÞYj–i

ð1� mjðhjÞÞ=Z ð21Þ

mð:hiÞ ¼ ð1� miðhiÞÞYj–i

ð1� mjðhjÞÞ=Z ð22Þ

mðHÞ ¼Y

j

ð1� mjðhjÞÞ=Z ð23Þ

Z ¼ 1� miðhiÞð1�Yi–j

ð1� mjðhjÞÞÞ ð24Þ

with a ¼ 1� HH max

� �c

ð25Þ

This combination has been proposed in (Galina, 1994)for handwriting recognition applications.

6. Combination

Let us consider now the combination of two indepen-dent MLPs trained on two different feature streams xa

and xb. Their corresponding BPAs ma and mb can beobtained in the three different ways described by expres-sions (15), (16), (18) and (20)–(24). BPAs ma and mb cannow be combined applying the Dempster–Shafer orthogo-nal sum (8). Let us consider separately the three case.

1. In the case of BPAs (15), (16) the combination rulegives:

mðhiÞ ¼ fmaðhiÞmbðhiÞ þ maðhiÞmaðHÞ þ mbðhiÞmaðHÞg=Z

ð26Þmð:hiÞ ¼ fmaðHÞmbð:hiÞ þ mbðHÞmað:hiÞg=Z ð27ÞmðHÞ ¼ fmaðHÞmbðHÞg=Z ð28ÞZ ¼ 1� mað:hiÞmbðhiÞ � mbð:hiÞmaðhiÞ ð29Þ

Expanding expression (26), we obtain:

mðhiÞ ¼ faaabðpiapib � pia � pibÞ þ aapia þ abpibg=Z ð30Þ

where aa ¼ ð1� HðpaÞ=HmaxÞc and ab ¼ ð1� HðpbÞ=HmaxÞc. It is straightforward to verify that this rule cannotbe reduced to linear or log-linear weighting of the individ-ual streams. We will refer in the following to this combina-tion simply as DS combination.

2. Let us now consider BPA defined as in expression (18),applying the orthogonal sum we obtain:

mðHÞ ¼ maðHÞmbðHÞ ¼ ð1� aapiaÞð1� abpibÞ ð31ÞmðhiÞ ¼ maðhiÞmbðhiÞ þ maðhiÞmaðHÞ þ mbðhiÞmaðHÞ

¼ 1� ð1� aapiaÞð1� abpibÞ ð32Þ

the normalization constant Z being equal to one. Thisexpression lead to a “product of errors” like combina-tion as shown in Eq. (32) thus in the following we willrefer to it as “product of errors” (PoE) combination.

3. In case of BPA defined as in (20)–(24), combination canbe obtained applying DS orthogonal as in expressions(26)–(29). Given that this method make double use oforthogonal sum (one to combine BPAs from the sameMLP and one to combine BPAs from two differentMLPs), we will refer to it as Double combination.

Values mðhiÞ can be directly used into the TANDEMsystem for performing speech recognition.

7. Experiments

In the following, we investigate the use of DS theory forcombining the output of MLP classifiers in small and largevocabulary speech recognition tasks. Experiments aim atcomparing DS with other rules like sum, product, mini-mum entropy and inverse entropy combinations. If thecombination scheme is effective, the results should not beworst than the performance of the best individual stream,similarly to human processing of separate sub-band.

The feature streams used for testing the combination areobtained training a Multi Layer Perceptron on differentspeech signal representations. As described in severalworks in literature (Zhu et al., 2004; Plahl et al., 2009),we decide to combine classifiers trained on conventionalshort-term spectral features and long-term features thatcapture the temporal modulations. The complementarityof those two features has been proved effective in smalland large vocabulary tasks (Zhu et al., 2004; Plahl et al.,

218 F. Valente / Speech Communication 52 (2010) 213–222

2009). The MLP is trained using back-propagation in orderto maximize the cross entropy between its output and thephonetic targets (Bourlard and Morgan, 1994). Two sepa-rate MLPs are trained using different temporal context: ashort temporal context (9-frames PLP Hermansky et al.,2000) and a long temporal context (a.k.a. as MRASTAHermansky and Fousek, 2005).

MRASTA processing consists in filtering one secondcritical band energy with a set of zero-mean multiple reso-lution filters. Feature extraction is composed of the follow-ing parts: critical band auditory spectrum is extracted fromshort time Fourier transform of a signal every 10 ms. A onesecond long temporal trajectory in each critical band is fil-tered with a bank of band-pass filters. Fifteen critical bandsare used. Those filters represent first derivatives G1 ¼ ½g1ri �(Eq. (33)) and second derivatives G2 ¼ ½g2ri � (Eq. (34)) ofGaussian functions with variance ri varying in the range8–130 ms (see Fig. 1). In effect, the MRASTA filters aremulti-resolution band-pass filters on modulation fre-quency, dividing the available modulation frequency rangeinto its individual sub-bands

g1riðxÞ / �xr2

iexpð�x2=ð2r2

i ÞÞ ð33Þ

g2riðxÞ /x2

r4i� 1

r2i

� �expð�x2=ð2r2

i ÞÞ ð34Þ

with ri ¼ f0:8; 1:2; 1:8; 2:7; 4; 6g:

In the modulation frequency domain, they correspond to afilter-bank with equally spaced filters on a logarithmicscale. Identical filters are used for all critical bands. Thus,they provide a multiple-resolution representation of thetime-frequency plane.

Those two posterior estimates are then combinedtogether, transformed according to TANDEM processing(Hermansky et al., 2000) and used as features in conven-tional HMM/GMM system. TANDEM consists in apply-ing a Log/KLT transform on the phoneme posteriors forgaussianization and decorrelation: the transformed posteri-

−30 0 +30−1

0

1

TIME

Fig. 1. Set of temporal filter obtained by first (G1 left picture) and second (G2and Fousek, 2005).

ors can be used as features in conventional HMM/GMMsystems. The process is depicted in Fig. 2.

The investigation is limited to only two feature streamsas typically done in LVCSR systems. In Section 8 we willdiscuss extension to many feature streams.

We propose two set of experiments for comparing theproposed rules with conventional combination frame-works. The first set of experiments is done on a smallvocabulary digits recognition task in which the perfor-mance of one of the streams is artificially lowered applyingpre-emphasis to the speech signal. The second set of exper-iments is run on a large vocabulary task for transcriptionof meetings in which noises are affecting in different waysthe performance of the individual feature streams.

7.1. Experiment I

The database used for recognition experiments consistsin the OGI Numbers 95 while MLPs are trained using 3 hof hand-labeled speech from the OGI-Stories database inorder to discriminate between phonemes. To artificiallylower the performance of one feature stream a first orderpre-emphasis filter with a ¼ 0:95 is applied to the test dataset. Multi-RASTA features are inherently robust to lineardistortion of the signal thus they hold the same perfor-mance even after pre-emphasis while PLP features arestrongly affected (see Hermansky and Fousek, 2005). Incase of DS combination, the value of c is set to 1E-3 asfrom tuning on an independent data subset in clean condi-tions (see Section 7.3). Results for matched and mis-matched test conditions as well as for differentcombination schemes are reported in Table 1. We comparethe three Dempster–Shafer based combinations with theother rules: sum, product, inverse entropy and minimum-entropy combinations.

In matched conditions, combination of posteriors givesalways better results than each posterior stream indepen-dently. Out of the combination rules based on traditionalprobability theory, the product holds the best performance.

−30 0 +301

0

1

TIME

right picture) order derivation of Gaussian function as from (Hermansky

Fig. 2. Multi-stream speech recognition diagram. Two separate MLPs are trained using 9-frames PLP and MRasta input features. Phoneme posteriorprobabilities are then combined according to the proposed rules. Final posterior estimates are then modified according to a log/KLT transform and usedin conventional HMM/GMM systems as features.

Table 1WER for different combination rules in matched and mismatched conditions. Sum, Prod (product), IEW (inverse entropy weighting), minimum entropy,PoE (product of errors), Dempster–Shafer (DS) and double combination.

MRASTA 9-frames PLP Sum Prod IEW Min-entropy PoE DS Double

Matched 3.5 3.7 3 2.8 2.9 3 2.8 2.6 2.8Mismatched 3.5 9.7 4.1 3.5 3.8 3.5 3.5 3.2 3.5

Table 2WER for RT05 evaluation data. Sum, Prod (product), IEW (inverseentropy weighting), minimum entropy, PoE (product of errors), Demp-ster–Shafer (DS) and double combination.

Features TOT AMI CMU ICSI NIST VT

9-frames PLP 46.6 41.4 43.7 31.3 54.5 64.9MRASTA 45.9 48.0 41.9 37.1 54.4 48.8

Sum 41.5 41.1 37.6 30.4 50.2 49.8Prod 40.2 39.4 37.1 29.5 48.4 47.7Min-entropy 41.3 40.4 37.9 29.6 49.1 52.3Inv-entropy 40.4 39.8 37.0 29.6 48.3 48.7

PoE 40.5 40.1 36.7 29.6 48.8 48.5DS 39.0 39.0 36.8 28.1 45.8 46.6

Double 40.0 40.2 37.2 29.3 47.4 47.3

F. Valente / Speech Communication 52 (2010) 213–222 219

Out of the three proposed combination frameworks, DS isperforming the best, giving 7% relative improvement inmatched conditions over the product rule. The product oferror rule and the double combination are producingresults similar to the product rule.

In mismatched conditions, inverse entropy combinationand sum combination produce an increase in WER w.r.tthe best feature stream, meaning that they are not effectivein this case. DS combination is giving 9% relative improve-ment w.r.t. the product rule. PoE and double combinationproduce WER similar to the product rule.

In summary, the DS combination outperforms all previ-ously proposed combination rules and does not show deg-radation even when there is a significant difference inperformances between the individual streams.

7.2. Experiment II

In this section, we pursue the investigation in a LargeVocabulary Continuous Speech Recognition (LVCSR) sys-tem. Experiments are run on a meetings transcription taskand evaluated on the RT05 evaluation data (Rich tran-scription, 2005). The training data for this system com-prises of individual headset microphone (IHM) data offour meeting corpora: NIST (13 h), ISL (10 h), ICSI(73 h) and a preliminary part of the AMI corpus (16 h).Those data are used for training MLPs and HMM/GMM models. Acoustic models are phonetically state tiedtriphones models trained using standard HTK maximumlikelihood training procedures. The recognition experi-ments are conducted on the NIST Rich Transcription 05(RT05) evaluation data (Rich transcription, 2005) whichconsists of data coming from the same meeting roomsavailable in the training set plus Virginia Tech recordings(VIT). We use the reference speech segments provided byNIST for decoding. The pronunciation dictionary is same

as the one used in AMI NIST RT05 system (Hain et al.,2005). The challenge of this data set consists in the varietyof acoustic environment in which data have been collected.In case of DS combination the value of c is set to 1E-3 as inprevious section i.e. from tuning on clean data from theOGI Numbers95 database.

Table 2 show results for 9-frames PLP features andMRASTA features and different combination framework.Overall WER for the two individual streams (MRASTAand 9-frames PLP) are similar but results change depend-ing on the evaluation data subset: on VT data MRASTAshows a much lower WER then the 9-frames PLP features;on the other hand on AMI and ICSI data, the situation isreversed.

As before we compare the three proposed combinationrules with sum, product, inverse entropy and minimum-entropy combinations.

All combination rules show significantly lower WERrespect to the individual streams. Out of the conventionalcombinations, the product provides the lowest overall

3.05

3.1DSSumProd

220 F. Valente / Speech Communication 52 (2010) 213–222

WER although on the CMU and NIST data set inverseentropy performs better.

DS combination holds the best overall performancew.r.t. other combination techniques. This is verified onall the RT05 data subsets. The improvement is rangingfrom 1.2% absolute w.r.t. the product combination to2.5% w.r.t. the sum combination. The double combinationoutperforms sum, product and inverse entropy but doesnot provide any improvements w.r.t. the DS combination.

The most interesting case is the VT data on which onefeature stream is much weaker than the other (64.9%against 48.8%). The product and inverse entropy combina-tions produce results close to those obtained using the bestfeature stream while the sum rule shows performance deg-radation. On the other hand, DS combination benefitsfrom both feature streams providing a WER reduction of2% w.r.t. the best stream.

Results confirm what observed in the previous sectioneven in case of LVCSR system and real world data.

10−6 10−5 10−4 10−3 10−2 10−1 1002.6

2.65

2.7

2.75

2.8

2.85

2.9

2.95

3

Gamma

WER

Inv−entropy

Fig. 3. WER function of the heuristic factor c for Dempster–Shafercombination. c is plotted on a logarithmic scale for the test set of the OGI

Numbers95 database. Constant lines are WER for sum, product andinverse entropy combination rules.

10−6 10−5 10−4 10−3 10−2 10−1 10039

39.5

40

40.5

41

41.5

DSSumProductInve−entropy

Fig. 4. WER function of the heuristic factor c for Dempster–Shafercombination. c is plotted on a logarithmic scale for the RT05 evaluationdata. Constant lines are WER for sum, product and inverse entropycombination rules.

7.3. Relation between belief and WER

In (Misra et al., 2003), the authors show that the use ofthe MLP output entropy is effective for determining theconfidence of a given stream. They heuristically set thestream weights as inversely proportional to the entropyof the MLP output (see Eq. (5)).

We defined the function that maps the entropy to thebelief function as:

a ¼ 1� HHmax

� �c

ð35Þ

where H is the entropy of the MLP output at given time in-stant, Hmax is the maximum possible entropy and c is a con-stant. This expression is meant to give a ¼ 0 if the MLPoutput entropy H is a uniform non-informative distribu-tion and a ¼ 1 if H ¼ 0. Once the value of c is differentfrom one, the relation between the entropy and the beliefis non-linear.

The use of the non-linearity is motivated from the rela-tion between the articulation index F ðsÞ and the probabilityof error s (see Fletcher, 1953) which has the form:

F ðsÞ ¼ logð1� sÞlogð1� 0:985Þ ¼ logð1� sÞc with

c ¼ 1

logð1� 0:985Þ ð36Þ

as described in Section 4. The value of the constant c wasfound by Fletcher doing measurement under idealconditions.

Based on the same considerations, we make the hypoth-esis that the relation between the entropy (thus a measureof the MLP error) and the belief may not be linear. Wetuned the value of c on an independent clean data set ofthe OGI Numbers95 database and kept fixed over all the

proposed experiments. The optimal value of c was foundto be 1E-3.

Fig. 3 plots the WER for the DS combination as a func-tion of the parameter c together with WER for the otherrules. The task consists in the OGI Numbers95 digit recog-nition (test set) in clean conditions. For values of c < 1,Dempster–Shafer outperforms all other rules. SimilarlyFig. 4 plots the WER as a function of the factor c for theRT05 evaluation data on meeting recognition task. Thelowest WER is again achieved for c equal to 1E-3. Thuswe conclude that this value is effective for clean and noisydata both in small and large vocabulary systems.

In order to investigate whether the improvement respectto the inverse entropy rule is coming from the use of the

Table 3WER for noisy numbers at different SNR for inverse entropy andaentropycombinations.

Features OGI numbersmatched

OGI numbersmismatched

RT05

Inv-entropy 2.9 3.8 40.4a-entropy 3 3.8 40.5

F. Valente / Speech Communication 52 (2010) 213–222 221

function a (35) or from the Dempster–Shafer combinationitself, we propose the following experiment. Rather thenperforming the linear combination using an estimate ofthe inverse entropy, the linear weighting is done accordingto the value of a, i.e.:

pi ¼ ðwa � pia þ wb � pibÞ ð37Þwith wa ¼ aa=ðaa þ abÞwb ¼ ab=ðaa þ abÞ

aa ¼ 1� HðpaÞHmax

� �c

ab ¼ 1� HðpbÞHmax

� �c

we refer to this rule as a-entropy combination. As in inverseentropy combination, the stream with the lower entropyreceives the largest weight and vice-versa. Experiments arerun with the same setups of sections 7.1 and 7.2 and re-ported in table 3. Inverse entropy and of a-entropy providesimilar results. They both are worst then those achievedwith Dempster–Shafer combination. This suggests thatthe improvements are coming from the combination rulerather then the weighting scheme.

7.4. Other experiments

Recently Dempster–Shafer combination has been testedin several systems that make use of MLP based features. InHwang et al., 2007, DS rule has been compared withinverse entropy for combining MLP features in a largevocabulary system for Mandarin Broadcast data transcrip-tion. In contrary to meetings data, there is no significantchannel mismatch. Results in (Hwang et al., 2007) showthat DS outperform inverse entropy also on Broadcastdata.

In between other related applications, we report(Thomas et al., 2008) where the authors used DS rule forcombining MLP classifiers trained on different spectro-temporal speech representations. The system results inimproved speech recognition over telephone channels.

8. Conclusion

In this paper, we extensively study the use of Dempster–Shafer combination rule for multi-stream speech recogni-tion. The initial motivation consists in the similarity inbetween DS orthogonal sum and observations on howhumans recognize speech in different sub-bands (Fletcher,1953; Allen, 2005). Combination based on DS rule schemescannot be reconducted to any of the previous linear orlog-linear feature streams weighting proposed in theliterature.

The paper briefly presents the Dempster–Shafer theoryof evidence and propose how to transform the probabilisticMLP output into a Basic Probability Assignment (BPA).We exhaustively studied the combination of Basic Proba-bility Assignment. Three different combination schemeshave been proposed and referred as:

� DS combination as it naturally arizes from the Demp-ster–Shafer theory of evidence.� “Product of Error” combination as it is obtained combin-

ing simple support functions.� Double combination as it is obtained applying twice the

DS orthogonal sum. This combination was first pro-posed in (Galina, 1994).

Experiments are run on small and large vocabularytasks and involves the combination of two feature streams,one obtained from short-term and on obtained from long-term representation of the signal as currently employed inseveral LVCSR systems (Zhu et al., 2004; Plahl et al.,2009). The considered tasks used real or artificial mismatchbetween the two feature set. Results reveals:

� DS combination is the most effective combination out ofthe three proposed rules.� DS combination outperforms other conventional combi-

nation schemes like inverse entropy, sum and product.� The Double combination and Product of errors combina-

tions do not provide any advantage w.r.t the DS

combination.� In contrary to other combination schemes, DS never

shows degradation w.r.t. the best feature stream and ittakes advantage of both streams even when one of themis considerably weaker then the other. This is verified forall the proposed tests.

Those results have been confirmed also in other LVCSRsystems (Hwang et al., 2007).

Here we limited the investigation to the combination oftwo feature streams, one derived from short-term represen-tation and one derived from long-term representation ofthe speech signal. The choice for those particular featureset comes from their application into LVCSR systems(see Zhu et al., 2004; Plahl et al., 2009). However, giventhat the DS combination is both associative and commuta-tive, it can be easily extended to more then two featurestreams, thus to the combination of a large number of clas-sifiers trained on different speech representations like in(Kleinschmidt, 2002).

Acknowledgments

This material is based upon work supported by the De-fense Advanced Research Projects Agency (DARPA) un-der Contract No. HR0011-06-C-0023. Any opinions,findings and conclusions or recommendations expressedin this material are those of the author(s) and do not

222 F. Valente / Speech Communication 52 (2010) 213–222

necessarily reflect the views of the Defense Advanced Re-search Projects Agency (DARPA). The author thankJithendra Vepa, Thomas Hain and AMI ASR team fortheir help with the meeting system. The author also thanksanonymous reviewers for their comments.

References

Allen, J.B., 2005. Articulation and Intelligibility. Morgan and Claypool.Bourlard, H., Dupont, S., 1996. A new asr approach based on indepen-

dent processing and re-combination of partial frequency bands. In:Proc. ICSLP 96.

Bourlard, H., Morgan, N., 1994. Connectionist Speech Recognition – AHybrid Approach. Kluwer Academic Publishers.

Fletcher, H., 1953. Speech and Hearing in Communication. Krieger, NewYork.

Galina, L.R., 1994. Combining the results of several neural networkclassifiers. Neural Networks 7 (5), 777–781.

Hain, T. et al, 2005. The 2005 AMI system for the transcription of speechin meetings. In: NIST RT05 Workshop Edinburgh, UK.

Hermansky, H., Fousek, P., 2005. Multi-resolution rasta filtering fortandem-based asr. In: Proc. Interspeech 2005.

Hermansky, H., Malayath, N., 1998. Spectral basis functions fromdiscriminant analysis. In: Proc. ICSLP’98, Sydney, Australia.

Hermansky, H. et al., 1996. Towards asr on partially corrupted speech. In:Proc. ICSLP.

Hermansky, H. et al., 2000. Connectionist feature extraction for conven-tional hmm systems. In: Proc. ICASSP.

Hwang, Mei-Yuh et al., 2007. Building a highly accurate mandarin speechrecognizer. In: Proc. IEEE Workshop on Automatic Speech Recog-nition and Understanding.

Kittler, J. et al., 1998. On combining classifiers. IEEE Trans. PAMI 20.Kleinschmidt, M., 2002. Methods for capturing spectro-temporal modu-

lations in automatic speech recognition. Acustica united with actaacustica 88 (3), 416–422.

Mandler, E.J., Schurman, J., 1988. Combining the classification results ofindependent classifiers based on Dempster/Shafer theory of evidence.Pattern Recognition Artificial Intell. X, 381–393.

Misra, H. et al., 2003. Entropy-based multi-stream combination. In: Proc.ICASSP.

Morgan, N. et al., 2004. Trapping conversational speech: extending trap/tandem approaches to conversational telephone speech recognition. In:Proc. ICASSP.

Plahl, C. et al, 2009. Recent improvements of the rwth gale mandarinLVCSR system. In: Proc. Interspeech, Brisbane, Australia.

Rich transcription evaluation campaign. <http://www.nist.gov/speech/tests/rt/rt2005/spring/>.

Shafer, G., 1976. A Mathematical Theory of Evidence. MIT Press,Princeton.

Stolcke A. et al., 2006. Cross-domain and cross-language portability ofacoustic features estimated by multilayer perceptrons. In: Proc.ICASSP.

Thomas, S. et al., 2008. Hilbert envelope based spectro-temporal featuresfor phoneme recognition in telephone speech. In: Proc. Interspeech.

Valente, F., Hermansky, H., 2007. Combination of acoustic classifiersbased on Dempster–Shafer theory of evidence. In: Proc. ICASSP.

Valente, F. et al., 2007. Multi-stream features combination based onDempster–Shafer rule for LVCSR system. In: Proc. Interspeech.

Xu, L. et al., 1992. Methods of combining multiple classifiers and theirapplications to handwriting recognition. IEEE Trans. Systems ManCybernet. 22 (3), 418–435.

Zhu, Q., Chen, B., Morgan, N., Stolcke, A., 2004. On using mlp featuresin lvcsr. In: Proc. ICSLP.