Automatic Detection of Poor Speech Recognition at the Dialogue Level

Automatic Detection of Poor Speech Recognitionat the Dialogue Level

Diane J. Litman, Marilyn A. Walker and Michael S. KearnsAT&T Labs Research

180 Park Ave, Bldg 103Florham Park, N.J. 07932

diane,walker,[email protected]

AbstractThe dialogue strategies used by a spoken di-alogue system strongly influence performanceand user satisfaction. An ideal system wouldnot use a single fixed strategy, but would adaptto the circumstances at hand. To do so, a sys-tem must be able to identify dialogue proper-ties that suggest adaptation. This paper focuseson identifying situations where the speech rec-ognizer is performing poorly. We adopt amachine-learning approach to learn rules froma dialogue corpus for identifying these situa-tions. Our results show a significant improve-ment over the baseline and illustrate that quan-titative, qualitative and acoustic features all ef-fect the learner's performance.

1 IntroductionBuilders of spoken dialogue systems face anumber of fundamental design choices thatstrongly influence both performance and usersatisfaction. Examples include choices be-tween user, system, or mixed initiative, and be-tween explicit and implicit confirmation of usercommands. An ideal system wouldn' t makesuch choices a priori, but rather would adaptto the circumstances at hand. For instance, asystem detecting that a user is repeatedly un-certain about what to say might move from userto system initiative, and a system detecting thatspeech recognition performance is poor mightswitch to a dialogue strategy with more explicitprompting, an explicit confirmation mode, orkeyboard input mode. Any of these adapta-tions might have been appropriate in dialogueD1 from the Annie system (Kamm et al., 1998),in Figure 1.

In order to improve performance through

such adaptation, a system must first be ableto identify, in real time, salient properties ofan ongoing dialogue that call for some use-ful change in system strategy. In other words,adaptive systems should try to automaticallyidentify actionable properties of ongoing dia-logues.

Previous work has shown that speech recog-nition performance is an important predictorof user satisfaction, and that changes in dia-logue behavior impact speech recognition per-formance (Walker et al., 1998; Litman et al.,1998; Kamm et al., 1998). Therefore, in thiswork, we focus on the task of automaticallydetecting poor speech recognition performancein several spoken dialogue systems at AT&TLabs. Rather than hand-crafting rules that clas-sify speech recognition performance in an on-going dialogue, we take a machine-learning ap-proach. We begin with a collection of systemlogs from actual dialogues that were labeledby humans as having had good or badspeech recognition (the training set). We thenapply standard machine-learning algorithms tothis training set in the hope of discovering, ina principled manner, classifiers that can auto-matically detect poor speech recognition duringnovel dialogues.

In order to train such classifiers, we mustprovide them with a number of salient prop-erties or features of dialogues derived fromthe system logs that might allow the system toautomatically identify poor recognition perfor-mance. In addition to identifying features thatprovide the best quantitative solutions, we arealso interested in comparing the performance ofclassifiers derived solely from acoustic fea-tures or from high-level dialogue features

Actual Dialogue ASR Interpretation ASR ConfidenceA1: Annie here.U1: Hello Annie, could you give me the information on a meeting I have tomorrow? REJECT -4.84A2: What was that?U2: I need to find out uh, a meeting time and place. Kim sent me a message on it. REJECT -4.20A3: For all options say help me out.U3: Help me out. REJECT -6.35A4: Annie here.U4: Help. help -1.74A5: Help. Say call or say send a message, followed by a label or a number.

Or say get my messages...U5: Get messages. (user interrupts A5) get me sips -3.33A6: Speech and Image Processing Lab.U6: Help help -1.78A7: Help. Say the first and last name ...U7: Kim. annie -3.81

. . .

Figure 1: Example dialogue excerpt D1 with Annie.

Since the dialogue systems we examine useautomatic speech recognition (ASR), one ob-vious feature available in the system log is aper-utterance score from the speech recognizerrepresenting its confidence in its interpreta-tion of the user's utterance (Zeljkovic, 1996).For dialogue D1, the recognizer's output andthe associated confidence scores are in the lasttwo columns of Figure 1. These confidencemeasures are based solely on acoustic infor-mation and are typically used by the dialoguemanager to decide whether it believes it hascorrectly understood the user's utterance. Notethat since our classification problem is definedby speech recognition performance, it might beargued that this confidence feature (or featuresderived from it) suffices for accurate classifica-tion.

However, an examination of the transcript inD1 suggests that other useful features might bederived from global or high-level properties ofthe dialogue history, such as features represent-ing the system's repeated use of diagnostic er-ror messages (utterances A2 and A3), or theuser's repeated requests for help (utterances U4and U6).

Although the work presented here focusesexclusively on the problem of automaticallydetecting poor speech recognition, a solutionto this problem clearly suggests system reac-tion, such as the strategy changes mentionedabove. In this paper, we report on our initial ex-periments, with particular attention paid to theproblem definition and methodology, the best

performance we obtain via a machine-learningapproach, and the performance differences be-tween classifiers based on acoustic and higher-level dialogue features.

2 Systems, Data, MethodsThis section describes experiments that usethe machine learning program RIPPER (Cohen,1996) to automatically induce a poor speechrecognition performance classification modelfrom a corpus of spoken dialogues. Our cor-pus consists of a set of 544 dialogues (over40 hours of speech) between humans and oneof three dialogue systems: ANNIE (Kamm etal., 1998), an agent for voice dialing and mes-saging; ELVIS (Walker et al., 1998), an agentfor accessing email; and TOOT (Litman etal., 1998), an agent for accessing online trainschedules. Each agent was implemented us-ing a general-purpose platform for phone-basedspoken dialogue systems (Kamm et al., 1997).The dialogues were obtained in controlled ex-periments designed to evaluate dialogue strate-gies for each agent. The experiments requiredusers to complete a set of application tasksin conversations with a particular version ofthe agent. The experiments resulted in both adigitized recording and an automatically pro-duced system log for each dialogue. In addi-tion, each user utterance was manually labeledas to whether it had been semantically misrec-ognized, by listening to the recordings whileexamining the system log. If the recognizer'soutput did not correctly capture the task-related

information in the utterance, it was labeled as amisrecognition. The dialogue recordings, sys-tem logs, and utterance labelings were pro-duced independently of the machine-learningexperiments described here.

RIPPER (like other learning programs, e.g.,C5.0 and CART) takes as input the names of aset of classes to be learned, the names and pos-sible values of a fixed set of features, trainingdata specifying the class and feature values foreach example in a training set, and outputs aclassification model for predicting the class offuture examples. In RIPPER, the classificationmodel is learned using greedy search guided byan information gain metric, and is expressed asan ordered set of if-then rules.

Our corpus is used to construct the machine-learning inputs as follows. Each dialogue is as-signed a class of either good or bad, by thresh-olding on the percentage of user utterances thatare labeled as ASR misrecognitions. We use athreshold of 11% to balance the classes in ourcorpus.

Our classes thus reflect relative goodnesswith respect to a corpus. Our threshold yields283 good and 261 bad dialogues. Dialogue D1in Figure 1 would be classified as bad, be-cause U5 and U7 (29% of the user utterances)are misrecognized.

Each dialogue is represented in terms of the23 features in Figure 2. In RIPPER, featurevalues are continuous (numeric), set-valued,or symbolic. Feature values are automati-cally computed from system logs, based onfive knowledge sources: acoustic, dialogue ef-ficiency, dialogue quality, experimental vari-ables, and lexical. Previous work correlatingmisrecognition rate with acoustic information,as well as our own hypotheses about the rele-vance of other types of knowledge, contributedto our features.

The acoustic, dialogue efficiency, and dia-logue quality features are all numeric-valued.The acoustic features are computed from eachutterance's confidence (log-likelihood) scores(Zeljkovic, 1996). Mean confidence representsthe average log-likelihood score for utterancesnot rejected during ASR. The four pmisrecs%(predicted percentage of misrecognitions) fea-tures represent different (coarse) approxima-

Acoustic Features

mean confidence, pmisrecs%1, pmisrecs%2, pmis-recs%3, pmisrecs%4

Dialogue Efficiency Features

elapsed time, system turns, user turns

Dialogue Quality Features

rejections, timeouts, helps, cancels, bargeins (raw) rejection%, timeout%, help%, cancel%, bargein% (nor-

malized)

Experimental Variable Features

system, user, task, condition

Lexical Features

ASR text

Figure 2: Features for spoken dialogues.

tions to the distribution of log-likelihood scoresin the dialogue. Each pmisrecs% feature usesa fixed threshold value to predict whether anon-rejected utterance is actually a misrecog-nition, then computes the percentage of user ut-terances that correspond to these predicted mis-recognitions. (Recall that our dialogue classi-fications were determined by thresholding onthe percentage of actual misrecognitions.) Forinstance, pmisrecs%1 predicts that if a non-rejected utterance has a confidence score be-low 2 then it is a misrecognition. The fourthresholds used for the four pmisrecs% featuresare 2;3;4;5, and were chosen by handfrom the entire dataset to be informative.

The dialogue efficiency features includeelapsed time (the dialogue length in seconds),and system turns and user turns (the number ofturns for each dialogue participant).

The dialogue quality features assess the nat-uralness of the dialogue. Rejections representsthe times that the system plays special rejec-tion prompts, e.g., utterances A2 and A3 indialogue D1. This occurs whenever the ASRconfidence score falls below a threshold asso-ciated with the ASR grammar for each sys-tem state (where the threshold was chosen bythe system designer). The rejections featurediffers from the pmisrecs% features in several

ways. First, the pmisrecs% thresholds are usedto determine misrecognitions rather than rejec-tions. Second, the pmisrecs% thresholds arefixed across all dialogues and are not depen-dent on system state. Third, a system rejectionevent directly influences the dialogue via the re-jection prompt, while the pmisrecs% thresholdshave no corresponding behavior.

Timeouts represents the times that the systemplays special timeout prompts because the userhasn' t responded. Helps represents the numberof times that the system responds to a user re-quest with a (context-sensitive) help message.Cancels represents the user's requests to undothe system's previous action. Bargeins repre-sents the number of user attempts to interruptthe system while it is speaking.1 In addition toraw counts, each feature is represented in nor-malized form by expressing the feature as a per-centage. For example, rejection% representsthe number of rejected user utterances dividedby the total number of user utterances.

The experimental variable features each havea different set of user-defined symbolic values.For example, the value of the feature system iseither annie, elvis, or toot. These fea-tures capture the conditions under which the di-alogue was collected.

The lexical feature ASR text is set-valued,and represents the transcript of the user's utter-ances as output by the ASR component.

The final input for learning is training data,i.e., a representation of a set of dialogues interms of feature and class values. In order toinduce classification rules from a variety of fea-ture representations our training data is rep-resented differently in different experiments.Our learning experiments can be roughly cat-egorized as follows. First, examples are rep-resented using all of the features in Figure 2(to evaluate the optimal level of performance).Figure 3 shows how Dialogue D1 from Fig-ure 1 is represented using all 23 features. Next,examples are represented using only the fea-tures in a single knowledge source (to com-paratively evaluate the utility of each knowl-edge source for classification), as well as usingfeatures from two or more knowledge sources

1This feature was hand-labeled.

(to gain insight into the interactions betweenknowledge sources). Finally, examples are rep-resented using feature sets corresponding to hy-potheses in the literature (to empirically testtheoretically motivated proposals).

The output of each machine-learning experi-ment is a classification model learned from thetraining data. To evaluate these results, the er-ror rates of the learned classification modelsare estimated using the resampling method ofcross-validation (Weiss and Kulikowski, 1991).In 25-fold cross-validation, the total set of ex-amples is randomly divided into 25 disjoint testsets, and 25 runs of the learning program areperformed. Thus, each run uses the examplesnot in the test set for training and the remainingexamples for testing. An estimated error rateis obtained by averaging the error rate on thetesting portion of the data from each of the 25runs.

3 ResultsFigure 4 summarizes our experimental re-sults. For each feature set, we report accu-racy rates and standard errors resulting fromcross-validation.2 It is clear that performancedepends on the features that the classifier hasavailable. The BASELINE accuracy rate resultsfrom simply choosing the majority class, whichin this case means predicting that the dialogueis always good. This leads to a 52% BASE-LINE accuracy.

The REJECTION% accuracy rates arise froma classifier that has access to the percentageof dialogue utterances in which the systemplayed a rejection message to the user. Previ-ous research suggests that this acoustic featurepredicts misrecognitions because users modifytheir pronunciation in response to system rejec-tion messages in such a way as to lead to fur-ther misunderstandings (Shriberg et al., 1992;Levow, 1998). However, despite our expec-tations, the REJECTION% accuracy rate is notbetter than the BASELINE.

Using the EFFICIENCY features does im-prove the performance of the classifier signif-

2Accuracy rates are statistically significantly differ-ent when the accuracies plus or minus twice the standarderror do not overlap (Cohen, 1995), p. 134.

mean confidence pmisrecs%1 pmisrecs%2 pmisrecs%3 pmisrecs%4 elapsed time system turns user turns-2.7 29 29 0 0 300 7 7rejections timeouts helps cancels bargeins rejection% timeout% help%3 0 2 0 1 43 0 29cancel% bargein% system user task condition0 14 annie mike day1 novices without tutorialASR textREJECT REJECT REJECT help get me sips help annie

Figure 3: Feature representation of dialogue D1.

Features Used Accuracy (Standard Error)BASELINE 52%

REJECTION% 54.5 % (2.0)EFFICIENCY 61.0 % (2.2)

EXPERIMENT VARS 65.5 % (2.2)DIALOGUE QUALITY (NORMALIZED) 65.9 % (1.9)

MEAN CONFIDENCE 68.4 % (2.0)EFFICIENCY + NORMALIZED QUALITY 69.7 % (1.9)

LEXICAL 72.0 % (1.7)BEST ACOUSTIC 72.6 % (2.0)

EFFICIENCY + QUALITY + EXPERIMENT VARS 73.4 % (1.9)ALL FEATURES 77.4 % (2.2)

Figure 4: Accuracy rates for dialogue classifiers using different feature sets, 25-fold cross-validation on 544 dialogues.

icantly above the BASELINE (61%). These fea-tures, however, tend to reflect the particular ex-perimental tasks that the users were doing.

The EXPERIMENT VARS (experimental vari-ables) features are even more specific to thisdialogue corpus than the efficiency features:these features consist of the name of the sys-tem, the experimental subject, the experimen-tal task, and the experimental condition (dia-logue strategy or user expertise). This infor-mation alone allows the classifier to substan-tially improve over the BASELINE classifier, byidentifying particular experimental conditions(mixed initiative dialogue strategy, or noviceusers without tutorial) or systems that were runwith particularly hard tasks (TOOT) with bad di-alogues. Since these features are specific to thiscorpus, we wouldn' t expect them to generalize.

The normalized DIALOGUE QUALITY fea-tures result in a similar improvement in per-formance (65.9%).3 However, unlike the effi-ciency and experimental variables features, the

3The normalized versions of the quality features didbetter than the raw versions.

normalization of the dialogue quality featuresby dialogue length means that rules learned onthe basis of these features are more likely togeneralize to other systems.

if (cancel% 6) then badif (elapsed time 282 secs) ^ (rejection% 6) then badif (elapsed time 90 secs) then baddefault is good

Figure 5: EFFICIENCY + NORMALIZED QUAL-ITY rules.

Adding the efficiency and normalized qualityfeature sets together (EFFICIENCY + NORMAL-IZED QUALITY) results in a significant perfor-mance improvement (69.7%) over EFFICIENCYalone. Figure 5 shows that this results in a clas-sifier with three rules: one based on qualityalone (percentage of cancellations), one basedon efficiency alone (elapsed time), and one thatconsists of a boolean combination of efficiencyand quality features (elapsed time and percent-age of rejections). The learned ruleset says thatif the percentage of cancellations is greater than

6%, classify the dialogue as bad; if the elapsedtime is greater than 282 seconds, and the per-centage of rejections is greater than 6%, clas-sify it as bad; if the elapsed time is less than 90seconds, classify it as bad4; otherwise classifyit as good. When multiple rules are applica-ble, RIPPER applies a conflict resolution strat-egy; when no rules are applicable, the defaultis used.

We discussed our acoustic REJECTION% re-sults above, based on using the rejection thresh-olds that each system was actually run with.However, a posthoc analysis of our experimen-tal data showed that our systems could have re-jected substantially more misrecognitions witha rejection threshold that was lower than thethresholds picked by the system designers. (Ofcourse, changing the thresholds in this waywould have also increased the number of rejec-tions of correct ASR outputs.) Recall that thePMISRECS% experiments explored the use ofdifferent thresholds to predict misrecognitions.The best of these results is given in Figure 4as BEST ACOUSTIC accuracy (72.6%). Thisclassifier learned that if the predicted percent-age of misrecognitions using the threshold forthat feature was greater than 8%, then the dia-logue was predicted to be bad, otherwise it wasgood. This classifier performs significantly bet-ter than the BASELINE, REJECTION% and EF-FICIENCY classifiers.

Similarly, MEAN CONFIDENCE is anotheracoustic feature, which averages confidencescores over all the non-rejected utterances ina dialogue. Since this feature is not tuned tothe applications, we did not expect it to per-form as well as the best PMISRECS% feature.However, the accuracy rate for the MEAN CON-FIDENCE classifier (68.4%) is not statisticallydifferent than that for the BEST ACOUSTIC clas-sifier. Furthermore, since the feature does notrely on picking an optimal threshold, it could beexpected to better generalize to new dialoguesituations.

The classifier trained on (noisy) ASR lexi-4This rule indicates dialogues too short for the user

to have completed the task. Note that this rule could notbe applied to adapting the system's behavior during thecourse of the dialogue.

cal output (LEXICAL) has access only to thespeech recognizer's interpretation of the user'sutterances. The LEXICAL classifier achieves72% accuracy, which is significantly better thanthe BASELINE, REJECTION% and EFFICIENCYclassifiers. Figure 6 shows the rules learnedfrom the lexical features alone. The rules in-clude lexical items that clearly indicate that auser is having trouble e.g. help and cancel.They also include lexical items that identifyparticular tasks for particular systems, e.g. thelexical item p-m identifies a task in TOOT.

if (ASR text contains cancel) then badif (ASR text contains the) ^ (ASR text contains get) ^ (ASR textcontainsTIMEOUT) then badif (ASR text contains today) ^ (ASR text contains on) then badif (ASR text contains the) ^ (ASR text contains p-m) then badif (ASR text contains to) then badif (ASR text contains help) ^ (ASR text contains the) ^ (ASR textcontainsread) then badif (ASR text contains help) ^ (ASR text contains previous) thenbadif (ASR text contains about) then badif (ASR text contains change-strategy) then baddefault is good

Figure 6: LEXICAL rules.

if (cancel% 4) ^ (system = toot) then badif (system turns 26) ^ (rejection% 5 ) then badif (condition = mixed)^ (user turns 12 ) then badif (system = toot) ^ (user turns 14 ) then badif (cancels 1) ^ (timeout% 11 ) then badif (elapsed time 87 secs) then baddefault is good

Figure 7: EFFICIENCY + QUALITY + EXPERI-MENTAL VARS rules.

Note that the performance of many of theclassifiers is statistically indistinguishable, e.g.the performance of the LEXICAL classifier isvirtually identical to the BEST ACOUSTIC clas-sifier and the EFFICIENCY + QUALITY + EX-PERIMENTAL VARS classifier. The similaritybetween the accuracies for a range of classifierssuggests that the information provided by dif-ferent feature sets is redundant. As discussedabove, each system and experimental condi-tion resulted in dialogues that contained lexicalitems that were unique to it, making it possi-ble to identify experimental conditions from thelexical items alone. Figure 7 shows the rules

that RIPPER learned when it had access to allthe features except for the lexical and acousticfeatures. In this case, RIPPER learns some rulesthat are specific to the TOOT system.

Finally, the last row of Figure 4 suggeststhat a classifier that has access to ALL FEA-TURES may do better (77.4% accuracy) thanthose classifiers that have access to acousticfeatures only (72.6%) or to lexical features only(72%). Although these differences are not sta-tistically significant, they show a trend (p

Automatic Detection of Poor Speech Recognition at the Dialogue Level

Documents

Transcript of Automatic Detection of Poor Speech Recognition at the Dialogue Level