[IEEE Proceedings of ICASSP '02 - Orlando, FL, USA (2002.05.13-2002.05.17)] IEEE International...

4
CAUSAL ANALYSIS OF SPEECH RECOGNITION FAILURE IN ADVERSE ENVIRONMENTS Guojun Zhou, Michael E. Deisher, and Sangita Sharma tel Cooration, MIS JF2-86, 2111 NE 25 Ave, Hillsboro, OR 97124 Emails:{Guoj.Zhou.Michael.Deisher.Sangita.Shaa}@intel.com ABSTCT A common approach to measuring the impact of noise and the effectiveness of noise mitigation M) algorithms for Automatic Speech Recognition (ASR) systems is to compare the word error rates (WERs). However, the WER measure does not give much insight into how algorithm affects phoneme-level acoustic characteristics. Such insight can help in tuning the parameters and may also lead reduced research time because the impact of an algorithm on ASR can first be investigated on smaller coora. In this paper, two measures, phoneme error rate (PER) and phoneme confidence score (peS), are investigated assess the impact of algorithms on the ASR perfoce. Experimental results using the TIMIT cous show that both PER and pes can help identi where the degradation om noise occurs as well as give a usel indication of how an algorithm may impact ASR perfoance. A diagnostic method based on these two measures is also proposed to assess the NM impact on ASR and help improve the algorithm perfoance. 1. INTRODUCTION Background noise can cause the perfoance of Automatic Speech Recoition (ASR) systems to degrade significantly. The degradation is mainly caused by the mismatch of the acoustic characteristics between the training and test data. One approach to reducing the mismatch is to simply rein the ASR system der the test environment. This method, however, works only if the test environment is kno and remains constant. There are many situations (e.g., in mobile applications) where e acoustic enviroent is changing d unpredictable, and thus it is not practical to retrain the ASR system. Another approach to addressing the mismatch issue is to pre-process the noisy speech signal using Noise Mitigation ) algo such t the pre-proc�ssed speech more closely matches e noise-ee speech used in the ining of e acoustic models. This approach, when achievable, is more practical the reaining approach in solving the mismatch problem. Even when algorithms fail to produce speech that matches e noise-ee ta used for training the acoustic models, they oſten produce the speech whose statistics vary less than those of processed speech across a range of acoustic environments. Therefore, significant improvement across a range of acoustic environments may be achieved by retraining only once on speech processed by an algothm. One direct and oſten-used approach measuring the effectiveness of NM algorits on speech recognition is to compare the word eor rates (WERs) with and without pre-processing. Although this method reveals the benefit to the end-user, it yields few insights into the root cause of the eors. In p�icular, it is impossible to distinish between errors due to poor articulation, environmental noise, or over/under-aggressive noise removal. These insights are usel since algorithms have a direct impact on the sub-word acoUstic characteristics at in impact the WER. Hence., studying the effect of algorithms on sub-word recoition perfoce can lead to insightl (as opposed to ad hoc) tuning of the parameters and may also reduce development time because ing can be conducted on smaller coora. This paper describes two phoneme-based measures to assess the impact of noise and algorits on the ASR perfoance. The first measure, phoneme eor rate (PER), quantifies the misclassification rate of phonemes th respect to a known phonetic transcript. The PER indicates the average impact of noise and processing. It can also reveal the degree of consion across different phoneme classes. The second mease, phoneme confidence score (peS), provides a way of measuring the severi of eors. Averaged over the entire test cous, the pes provides an indication of the overall classifier performance. For a paicular uerance, the pes can identi "hot spots" where e recoizer is having difficulty. The diagnostic infoation provided by PER and pes allows the noise mitigation developer to sdy performance trends at the phonemic level and detect damage done by over-aggressive noise mitigation algorits. Subsequently, noise mitigation algorims can be modified or "ned" to avoid such damage and improve speech recognition perfoce. 0-7803-7402-9/02/$17.00 ©2002 IEEE IV - 3816

Transcript of [IEEE Proceedings of ICASSP '02 - Orlando, FL, USA (2002.05.13-2002.05.17)] IEEE International...

Page 1: [IEEE Proceedings of ICASSP '02 - Orlando, FL, USA (2002.05.13-2002.05.17)] IEEE International Conference on Acoustics Speech and Signal Processing - Causal analysis of Speech Recognition

CAUSAL ANALYSIS OF SPEECH RECOGNITION FAILURE

IN ADVERSE ENVIRONMENTS

Guojun Zhou, Michael E. Deisher, and Sangita Sharma

Intel Corporation, MIS JF2-86, 2111 NE 25th Ave, Hillsboro, OR 97124 Emails:{Guojun.Zhou.Michael.Deisher.Sangita.Sharma}@intel.com

ABSTRACT

A common approach to measuring the impact of noise and the effectiveness of noise mitigation (NM) algorithms for Automatic Speech Recognition (ASR) systems is to compare the word error rates (WERs). However, the WER measure does not give much insight into how an NM algorithm affects phoneme-level acoustic characteristics. Such insight can help in tuning the NM parameters and may also lead to reduced research time because the impact of an NM algorithm on ASR can first be investigated on smaller corpora. In this paper, two measures, phoneme error rate (PER) and phoneme confidence score (peS), are investigated to assess the impact of NM algorithms on the ASR performance. Experimental results using the TIMIT corpus show that both PER and pes can help identify where the degradation from noise occurs as well as give a useful indication of how an NM algorithm may impact ASR performance. A diagnostic method based on these two measures is also proposed to assess the NM impact on ASR and help improve the NM algorithm performance.

1. INTRODUCTION Background noise can cause the performance of Automatic Speech Recognition (ASR) systems to degrade significantly. The degradation is mainly caused by the mismatch of the acoustic characteristics between the training and test data. One approach to reducing the mismatch is to simply retrain the ASR system under the test environment. This method, however, works only if the test environment is known and remains constant. There are many situations (e.g., in mobile applications) where the acoustic environment is changing and unpredictable, and thus it is not practical to retrain the ASR system. Another approach to addressing the mismatch issue is to pre-process the noisy speech signal using Noise Mitigation (NM) algorithms such that the pre-proc�ssed speech more closely matches the noise-free speech used in the training of the acoustic models. This approach, when achievable, is more practical than the retraining approach in solving the mismatch problem. Even when NM algorithms fail to

produce speech that matches the noise-free data used for training the acoustic models, they often produce the speech whose statistics vary less than those of unprocessed speech across a range of acoustic environments. Therefore, significant improvement across a range of acoustic environments may be achieved by retraining only once on speech processed by an NM algorithm.

One direct and often-used approach to measuring the effectiveness of NM algorithms on speech recognition is to compare the word error rates (WERs) with and without pre-processing. Although this method reveals the benefit to the end-user, it yields few insights into the root cause of the errors. In p�icular, it is impossible to distinguish between errors due to poor articulation, environmental noise, or over/under-aggressive noise removal. These insights are useful since I;olM algorithms have a direct impact on the sub-word acoUstic characteristics that in turn impact the WER. Hence., studying the effect of NM algorithms on sub-word recognition performance can lead to insightful (as opposed to ad hoc) tuning of the NM parameters and may also reduce development time because tuning can be conducted on smaller corpora.

This paper describes two phoneme-based measures to assess the impact of noise and NM algorithms on the ASR performance. The first measure, phoneme error rate (PER), quantifies the misclassification rate of phonemes with respect to a known phonetic transcript. The PER indicates the average impact of noise and NM processing. It can also reveal the degree of confusion across different phoneme classes. The second measure, phoneme confidence score (peS), provides a way of measuring the severity of errors. Averaged over the entire test corpus, the pes provides an indication of the overall classifier performance. For a particular utterance, the pes can identify "hot spots" where the recognizer is having difficulty. The diagnostic information provided by PER and pes allows the noise mitigation developer to study performance trends at the phonemic level and detect damage done by over-aggressive noise mitigation algorithms. Subsequently, noise mitigation algorithms can be modified or "tuned" to avoid such damage and improve speech recognition performance.

0-7803-7402-9/02/$17.00 ©2002 IEEE IV - 3816

Page 2: [IEEE Proceedings of ICASSP '02 - Orlando, FL, USA (2002.05.13-2002.05.17)] IEEE International Conference on Acoustics Speech and Signal Processing - Causal analysis of Speech Recognition

2. NOISE MITIGATION ASSESSMENT MEASURES

2.1. Phoneme Error Rate (PER)

The PER measures the phoneme classification accuracy. It is calculated as follows,

PER = <#insertions > + <#deletions > + <#substitutions > . <# phonemes >

This fonnula can be extended to compute the phoneme­class classification error rate by replacing each triphone in transcripts with phoneme class names based on the base phone in the triphone. In addition, phoneme or phoneme­class confusion matrices can be analyzed.

2.2. Phoneme Confidence Score (PCS)

The PER does not measure the discriminability of phonemes, i.e., it does not measure the distance between the winning phoneme likelihood score and the scores of other competing phoneme models. A higher distance would indicate better phoneme discriminability and hence a distance measure could be useful to assess an NM algorithm's effectiveness. One approach to measure the improvement in phoneme. discriminability is to measure the PCS during recognition. The PCS proposed in this paper is computed with reference to a known phonemic transcript. Let n be the index (0, ... , N-I) of the phones within an utterance (or corpus of utterances). Let dn denote the number of frames in the nth segment. Let Yn be the feature sequence corresponding to the nth segment. Then the confidence ·score for the segment n with respect to phonemejis

where

if ( ) 1 I { (Ilj(y.) l r.Ewj } (1) con j n = - og I d. max (Ilk (Y.) r,w

k • •

¢j(Y.) = � P(Y. I X •. A]) . (2)

is the Viterbi likelihood score produced by the HMM model A. j of the phoneme W]. A measure similar to

Equation (1) has been proposed in [2] and [3] in the context of using acoustic confidence to reject out-of­vocabulary words. However, Equation (1) differs from these measures in that it considers only the most confusable phone likelihood in the confidence computation.

Given the segmental PCS, the average phoneme confidence can be obtained by

1 Ne pes = -' Lcon/j(n), Wi chosen (3)

Nc n=l The pes provides a single number that indicates the overall discriminability of the acoustic model when the correct phoneme is chosen. Here, IV" is the number of speech segments that were correctly classified.

3. EVALUATION FRAMEWORK

In order to compute the PER and PCS, the decoder of a large vocabulary continuous speech recognizer (L VCSR) was modified to output: 1) the best phoneme sequence with segmentation information and 2) for a given input,segment the segmental likelihood scores �f all possible plionemes. Since the LVCSR system we used, like most others, uses the triphone (or context dependent phoneme) as the basic acoustic unit, triphones rather than mono-phones are used for the computation of the PER and PCS.

The NIST Scoring tool [4] was adopted for the PER computation. For the PCS computation, a practical limitation upon the implementation of equation (I) is the number of computations involved when scores for all the triphones are computed. For example, there are around 10,000 triphones in the WSJ-based acoustic models (WSJ is an LDC corpus consisting mainly read Wall Street Journal articles [5]). Therefore, it is desirable to perfonn the computation using only the correct triphone and a subset of the competing triphones. The triphone candidate subset for each triphone is specified in advance to the decoder. This approach is flexible enough to allow for scoring across arbitrary triphone classes. As an example, in the case of the two classes, vowel and non-vowel, the triphone candidate list could be constructed such that, for each triphone belonging to the vowel class, competing candidates are taken from the non-vowel class only (and vice-versa). In this evaluation, the triphone candidate list was constructed by choosing those triphones that have the same context (right & left phones) as candidates for a given triphone.

Mel-frequency cepstral coefficient (MFCq features and the WSJ-trained acoustic models are used for the decoding. Since the purpose of these two phoneme-level measures is to assess the effects of signal quality upon acoustic model performance, language models as well as lexicon trees, nonnally applied during decoding to improve the general ASR accuracy, are intentionally removed. Both the train and test domains of the TIMIT corpus comprising a total of 6,300 utterances are used for evaluation. The noisy version of TIM IT was obtained by playing TIMIT utterances into a noisy room, and then re-recording the speech. The re­recorded version of TIMIT data included the ambient noise in the room, mainly from computer cooling fans and overhead air conditioning ducts. The phoneme-level segmentation of the original version was preserved for the noisy version. The Ephraim-Malah noise suppression algorithm [I] was employed to remove noise from the noisy TIMIT data.

Our initial experiments showed that many triphones are mis-recognized because of the mismatch of phone context (i.e., the left and right phones) and the confusion with

IV-3817

Page 3: [IEEE Proceedings of ICASSP '02 - Orlando, FL, USA (2002.05.13-2002.05.17)] IEEE International Conference on Acoustics Speech and Signal Processing - Causal analysis of Speech Recognition

phones that are acoustically similar to the base phone (mid phone). This observation suggested that phone-class-based PER and pes could be more useful. Hence, in this evaluation, we divided all triphones into 8 basic phone classes based on their base phones. The phone classes considered were vowels, fricatives & affiicatives, diphthongs, nasals, whispers, stops, semivowels, and others, as shown in Table I.

Class Name Base Phones

DI (DIPHTHONG) ay, oy, ey, ow, aw, iy FR (FRlCA TIVE) s, th, f, sh, z, zh, db, v, jh, ch

NA (NASAL) m, n, ng, em, en

SE (SEMIVOWEL) w, 1, el, r, y

ST(STOP) p, t, k, b, d, g VO(VOWEL) ih, eh, ae, aa, er, ah, ao, UX, uw,

00, ix, ax, axr

WH (WHISPER) hv,hh

OT{OTHER) sil, sp, pau

Table 1. Phone Classes Used in the Evaluation

4. RESULTS

The PER scores averaged over the entire test corpus are shown in Table 2. Although the triphone-based average PERs are very high, especially for the noisy and processed speech, they are still much lower than a random guess that is in the order of 99.99% (=1_10"', there are about a total of 104 triphones). As mentioned in Section 3, triphones are very confusable with triphones having the same context and also with triphones having acoustically similar close-base phones. Hence, the PER based solely on the triphones may not be a good indicator for NM algorithms. Therefore, we further computed the phone-class-based PERs, which, as expected, are much lower than tripone-based PERs. The 5.4% relative phone-class-based PER improvement of the processed speech over the noisy speech indicates that the NM algorithm under test does improve acoustic model performance.

Table 3 shows the average PCSs for phonemes and classes in general and for each individual class. It is not a surprise to see that noise brings down the average pes significantly across the board. For phonemes and classes in general (peSs are averaged across all phonemes and classes), the NM algorithm helps increase the average pes with respect to the PCS of noisy speech. The NM algorithm helps improve the average pes for vowels and semivowels, but degrades the average pes for diphthongs, fricatives & affiicatives, nasals, stops, and whispers. Except diphthongs, all the phoneme classes that have a lower average pes after the NM algorithm is applied, are consonants that have lower energy and are more vulnerable to noise. It is likely that the NM algorithm may increase the degree of confusion among them. As for the diphthongs, they consist

of two vowels with long transition between them. Usually the first vowel has more energy than the second. This characteristic of diphthongs may cause a diphthong to be recognized as two separate phones (e.g., two vowels or one voweVsemivowel plus one consonants), and this affects their average PCS.

PER Speech Condition

Clean Noisy Processed

Triphone-based 76.62% 94.50% 94.50% Phone-class-based 17.56% 46.34% 43.83%

Table 2. Triphone- and phone-class-based PERs

Table 3. Average PCSs

The average PERs and pess can provide a general indication of the impact of the NM algorithm on the ASR performance .. A more detailed diagnostic analysis is still needed in order to find out the "hot spots" where the NM algorithm needs to improve its perfonnance.

5. DIAGNOSTIC ANALYSIS The PER can be further broken down to each phoneme class, which is shown in Table 4. For clean speech, the PER does not vary too much across different phoneme classes except for whispers and "others" (The high PER for these two classes is expected because they have very low energy). The noise, however, reduces the PER to different degrees for each class, with the most degradation for some consonants (e.g., fricatives, stops, and others), but relatively lower degradation to vowels and nasals. This is probably due to the high energy in vowel-like phones. We further analyzed the confusability across different phoneme classes by calculating the mis-recognition across classes based on the substitution errors in phoneme-class-based PER computation. Table 5 sho� the most confusable phoneme classes under different acoustic conditions. It can be seen that the noise significantly changes the confusion distribution for each phoneme class, especially for consonants such as fricatives. The NM algorithm is able to bring the distribution patterns back closer to those under the clean condition for vowels and semivowels, but is not as successful for consonants such as fricatives and stops.

IV - 3818

Page 4: [IEEE Proceedings of ICASSP '02 - Orlando, FL, USA (2002.05.13-2002.05.17)] IEEE International Conference on Acoustics Speech and Signal Processing - Causal analysis of Speech Recognition

DI FR NA SE ST VO WH OT .. 7::Clean.� -·to :;N2'.1' ;lS'� tH�'f, '\\:J.:4) . <t:S{ t'$38j·;· ':'41 �:

Noisy 41 S8 32 39 SO 30 SO 93

P;oc'eSiecP� '4S "st··, 41 ". 31"{' '?4f3 w .:; <i(i'· 70" 9.7:<: Table 4. PER of each phone class (%)

Phone Most Confusable Classes (sum >=70%) Class Clean Noisy Processed'

DI VO(61), SE(17) VO(49), SE(IS) VO(50), SE(20) NA (9)

FR ST(46), SE(19) ST(22), WH(21) ST(26), SE(2S) OT(9) NA(20), SE(l4) NA(l9)

NA ST(31), VO(29) ST(2S), VO(21) ST(32), SE(2S) SE(l9) SE(16), FRJJ3J VO(22)

SE FR(26), DI(2S) DI(2S) , NA(21) DI(22), NA(21) ST(l8), VO(15J WH(I7), FR(l4) FR(l9). ST(l71

ST FR(38), NA(17) NA(37), VO(16) NA(31), VO(22) SE(l3).OT(l2) FR(l2), WH(l2) SE{l9)

VO DI(SS), SE(20) DI(33), NA(28) D[(39), NA(22) ST(l4) ST(l6)

WH ST(38), FR(20) NA(37), ST(18) NA(30), ST(20)

VO(l3) SE(I3), DI(l2) SE(l8), DI(l2)

OT ST(45), VO(16) VO(43), NA(23) VO(41), NA(20) SE(l1) ST(l2) ST(l8)

Table S. Confusion Distribution between Classes (%)

Analysis of the class-based PER and cross-class confusion distribution shows that the NM algorithm does improve the recognition of vowels or vowel-like phones and suggests that further improvements are needed to more effectively remove noise from consonants without changing their acoustic characteristics. The analysis of cross-class confusion distributions, however, does not explain why the average PCS of diphthongs degrades after the NM algorithm is applied. To examine this case, a phrase ("stake-out anyway") with several diphthongs was selected for PCS analysis. The PCS for each phoneme in this phrase is plotted in Fig. 1, from which it can be seen that the PCS of all phonemes except the diphthong, "/ey/", is increased after the NM algorithm. Note that the PCS increase after the NM algorithm for consonants is not typical. A further look at the spectrogram of the phrase under clean, noisy, and processed conditions, suggests that the NM algorithm is over-aggressive for diphthong "/ey/". Although harsh attenuation is also evident for other phonemes, especially "/aw/" and "/iy!", their spectrograms are closer to the clean ones than the noise-degraded spectrograms.

• .. r: I r: 0 ..

25 20 Celean 15 • Noisy 10 IiJ Processed

5 0

·5 -10

lsi It! leyl /kI lawl It! libl In! liyl Iwl leyl FR ST DI ST DI ST VO NA DI SE DI

Figure 1. PCS Analysis for Phrase "stake-out anyway"

( a )

(b)

(c)

7kHz SkHz

7kHz SkHz

7kHz 5kHz

lsI It! leyl Ik/ lawl It! lihllnl/iy/lw/ leyl

Figure 2. Spectrograms of the phrase "stake-out anyway" «a) processed, (b) noisy, (c) clean)

As a point of reference, it should be noted that this NM algorithm does help the L VCSR reduce the WER on WSJ test set by about 50% for the noise condition evaluated in this paper.

6. CONCLUSIONS Two phoneme-level measures, phoneme error rate and phoneme confidence were used to assess the impact of a noise mitigation algorithm on speech recognition perfonnance. The evaluation results show that the two measures can provide a helpful indication of the NM algorithm performance in terms of improving the speech recognition in noisy environments. The diagnostic analysis based on these two measures can further provide insight into which phoneme classes need to be improved for an NM algorithm under consideration .

7. ACKNOWLEDGEMENT The authors would like to thank Baosheng Yuan, Jinyu Li, and Jian Li of Intel China Research Center for helpful discussions and modifications to LVCSR decoder. The authors are also grateful to David Graumann for reviewing this paper.

8. REFERENCES [1] Y. Ephraim and D. Malah, "Speech Enhancement Using a

Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Trans. ASSP, Vol. 32, No.6, Dec. 1984, pp. 1109-1121. [2] G. Bouwman, L. Boves,' and J. Koolwaaij, "Weighting phone confidence measures for automatic speech recognition", COST249 Workshop on Voice Operated Telecom Services, Ghent, Belgium, pp. 59-62 [3] T. Jitsuhiro, S: Takahashi, and K. Aikawa, "Rejection of out­of-vocabulary words using phoneme confidence likelihoods", IEEE Conference on Acoustics, Speech and Signal Processing, 1998, Volume: I ,Page(s): 217 -220. [4] NIST SCORE 3.6.2, http://www.nist.gov/speechltools/index.htm.

[5] Linguistic Data Consortium, http://www.ldc.upenn.edul.

IV - 3819