Fusing Eye Gaze with Speech Recognition Hypotheses to ...

11
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 471–481, MIT, Massachusetts, USA, 9-11 October 2010. c 2010 Association for Computational Linguistics Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue Zahar Prasov and Joyce Y. Chai Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824, USA {prasovza,jchai}@cse.msu.edu Abstract In situated dialogue humans often utter lin- guistic expressions that refer to extralinguistic entities in the environment. Correctly resolv- ing these references is critical yet challeng- ing for artificial agents partly due to their lim- ited speech recognition and language under- standing capabilities. Motivated by psycholin- guistic studies demonstrating a tight link be- tween language production and human eye gaze, we have developed approaches that in- tegrate naturally occurring human eye gaze with speech recognition hypotheses to resolve exophoric references in situated dialogue in a virtual world. In addition to incorporat- ing eye gaze with the best recognized spo- ken hypothesis, we developed an algorithm to also handle multiple hypotheses modeled as word confusion networks. Our empirical re- sults demonstrate that incorporating eye gaze with recognition hypotheses consistently out- performs the results obtained from processing recognition hypotheses alone. Incorporating eye gaze with word confusion networks fur- ther improves performance. 1 Introduction Given a rapid growth in virtual world applications for tutoring and training, video games and simu- lations, and assistive technology, enabling situated dialogue in virtual worlds has become increasingly important. Situated dialogue allows human users to navigate in a spatially rich environment and carry a conversation with artificial agents to achieve spe- cific tasks pertinent to the environment. Different from traditional telephony-based spoken dialogue systems and multimodal conversational interfaces, situated dialogue supports immersion and mobility in a visually rich environment and encourages so- cial and collaborative language use (Byron et al., 2005; Gorniak et al., 2006). In situated dialogue, hu- man users often need to make linguistic references, known as exophoric referring expressions (e.g., the book to the right), to extralinguistic entities in the environment. Reliably resolving these ref- erences is critical for dialogue success. However, reference resolution remains a challenging problem, partly due to limited speech and language process- ing capabilities caused by poor speech recognition (ASR), ambiguous language, and insufficient prag- matic knowledge. To address this problem, motivated by psycholin- guistic studies demonstrating a close relationship between language production and eye gaze, our previous work has incorporated naturally occurring eye gaze in reference resolution (Prasov and Chai, 2008). Our findings have shown that eye gaze can partially compensate for limited language process- ing and domain modeling. However, this work was conducted in a setting where users only spoke to a static visual interface. In situated dialogue, human speech and eye gaze patterns are much more com- plex. The dynamic nature of the environment and the complexity of spatially rich tasks have a massive influence on what the user will look at and say. It is not clear to what degree prior findings can generalize to situated dialogue. Therefore, this paper explores new studies on incorporating eye gaze for exophoric reference resolution in a fully situated virtual envi- 471

Transcript of Fusing Eye Gaze with Speech Recognition Hypotheses to ...

Page 1: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 471–481,MIT, Massachusetts, USA, 9-11 October 2010. c©2010 Association for Computational Linguistics

Fusing Eye Gaze with Speech Recognition Hypotheses toResolve Exophoric References in Situated Dialogue

Zahar Prasov and Joyce Y. ChaiDepartment of Computer Science and Engineering

Michigan State UniversityEast Lansing, MI 48824, USA

{prasovza,jchai}@cse.msu.edu

Abstract

In situated dialogue humans often utter lin-guistic expressions that refer to extralinguisticentities in the environment. Correctly resolv-ing these references is critical yet challeng-ing for artificial agents partly due to their lim-ited speech recognition and language under-standing capabilities. Motivated by psycholin-guistic studies demonstrating a tight link be-tween language production and human eyegaze, we have developed approaches that in-tegrate naturally occurring human eye gazewith speech recognition hypotheses to resolveexophoric references in situated dialogue ina virtual world. In addition to incorporat-ing eye gaze with the best recognized spo-ken hypothesis, we developed an algorithm toalso handle multiple hypotheses modeled asword confusion networks. Our empirical re-sults demonstrate that incorporating eye gazewith recognition hypotheses consistently out-performs the results obtained from processingrecognition hypotheses alone. Incorporatingeye gaze with word confusion networks fur-ther improves performance.

1 Introduction

Given a rapid growth in virtual world applicationsfor tutoring and training, video games and simu-lations, and assistive technology, enabling situateddialogue in virtual worlds has become increasinglyimportant. Situated dialogue allows human users tonavigate in a spatially rich environment and carrya conversation with artificial agents to achieve spe-cific tasks pertinent to the environment. Different

from traditional telephony-based spoken dialoguesystems and multimodal conversational interfaces,situated dialogue supports immersion and mobilityin a visually rich environment and encourages so-cial and collaborative language use (Byron et al.,2005; Gorniak et al., 2006). In situated dialogue, hu-man users often need to make linguistic references,known as exophoric referring expressions (e.g., thebook to the right), to extralinguistic entitiesin the environment. Reliably resolving these ref-erences is critical for dialogue success. However,reference resolution remains a challenging problem,partly due to limited speech and language process-ing capabilities caused by poor speech recognition(ASR), ambiguous language, and insufficient prag-matic knowledge.

To address this problem, motivated by psycholin-guistic studies demonstrating a close relationshipbetween language production and eye gaze, ourprevious work has incorporated naturally occurringeye gaze in reference resolution (Prasov and Chai,2008). Our findings have shown that eye gaze canpartially compensate for limited language process-ing and domain modeling. However, this work wasconducted in a setting where users only spoke to astatic visual interface. In situated dialogue, humanspeech and eye gaze patterns are much more com-plex. The dynamic nature of the environment andthe complexity of spatially rich tasks have a massiveinfluence on what the user will look at and say. It isnot clear to what degree prior findings can generalizeto situated dialogue. Therefore, this paper exploresnew studies on incorporating eye gaze for exophoricreference resolution in a fully situated virtual envi-

471

Page 2: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

ronment — a more realistic approximation of realworld interaction. In addition to incorporating eyegaze with the best recognized spoken hypothesis, wedeveloped an algorithm to also handle multiple hy-potheses modeled as word confusion networks.

Our empirical results have demonstrated the util-ity of eye gaze for reference resolution in situ-ated dialogue. Although eye gaze is much morenoisy given the mobility of the user, our resultshave shown that incorporating eye gaze with recog-nition hypotheses consistently outperform the re-sults obtained from processing recognition hypothe-ses alone. In addition, incorporating eye gaze withword confusion networks further improves perfor-mance. Our analysis also indicates that, although aword confusion network appears to be more compli-cated, the time complexity of its integration with eyegaze is well within the acceptable range for real-timeapplications.

2 Related Work

Prior work in reference resolution within situated di-alogue has focused on using visual context to assistreference resolution during interaction. In (Kelleherand van Genabith, 2004) and (Byron et al., 2005), vi-sual features of objects are used to model the focusof attention. This attention modeling is subsequentlyused to resolve references. In contrast to this line ofresearch, here we explore the use of human eye gazeduring real-time interaction to model attention andfacilitate reference resolution. Eye gaze provides aricher medium for attentional information, but re-quires processing of a potentially noisy signal.

Eye gaze has been used to facilitate human ma-chine conversation and automated language process-ing. For example, eye gaze has been studied inembodied conversational discourse as a mechanismto gather visual information, aid in thinking, or fa-cilitate turn taking and engagement (Nakano et al.,2003; Bickmore and Cassell, 2004; Sidner et al.,2004; Morency et al., 2006; Bee et al., 2009).Recent work has explored incorporating eye gazeinto automated language understanding such as au-tomated speech recognition (Qu and Chai, 2007;Cooke and Russell, 2008), automated vocabulary ac-quisition (Liu et al., 2007; Qu and Chai, 2010), at-tention prediction (Qvarfordt and Zhai, 2005; Fang

et al., 2009).Motivated by previous psycholinguistic findings

that eye gaze is tightly linked with language pro-cessing (Just and Carpenter, 1976; Tanenhous et al.,1995; Meyer and Levelt, 1998; Griffin and Bock,2000), our prior work incorporates eye gaze intoreference resolution. Our results demonstrate thatsuch use of eye gaze can potentially compensatefor a conversational systems limited language pro-cessing and domain modeling capability (Prasov andChai, 2008). However, this work is conducted in astatic visual environment and evaluated only on tran-scribed spoken utterances. In situated dialogue, eyegaze behavior is much more complex. Here, gazefixations may be made for the purpose of naviga-tion or scanning the environment rather than refer-ring to a particular object. Referring expressions canbe made to objects that are not in the user’s field ofview, but were previously visible on the interface.Additionally, users may make egocentric spatial ref-erences (e.g. “the chair on the left”) which requirecontextual knowledge (e.g. the users position in theenvironment) in order to resolve. Therefore, the fo-cus of our work here is on exploring these complexuser behaviors in situated dialogue and examininghow to combine eye gaze with ASR hypotheses forimproved reference resolution.

Alternative ASR hypotheses have been used inmany different ways in speech driven systems. Par-ticularly, in (Mangu et al., 2000) multiple latticealignment is used for construction of word confusionnetworks and in (Hakkani-Tur et al., 2006) wordconfusion networks are used for named entity de-tection. In the study presented here, we apply wordconfusion networks (to represent ASR hypotheses)along with eye gaze to the problem of reference res-olution.

3 Data Collection

In this investigation, we created a 3D virtual world(using the Irrlicht game engine1) to support situateddialogue. We conducted a Wizard of Oz study inwhich the user must collaborate with a remote arti-ficial agent cohort (controlled by a human) to solvea treasure hunting task. The cohort is an “expert”in treasure hunting and has some knowledge regard-

1http://irrlicht.sourceforge.net/

472

Page 3: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

ing the locations of the treasure items, but cannotsee the virtual environment. The user, immersed inthe virtual world, must navigate the environment andconduct a mixed-initiative dialogue with the agent tofind the hidden treasures. During the experiments,a noise-canceling microphone was used to recorduser speech and the Tobii 1750 display-mounted eyetracker was used to record user eye movements.

A snapshot of user interaction with the treasurehunting environment is shown in Figure 1. Here,the user’s eye fixation is represented by the whitedot and saccades (eye movements) are representedby white lines. The virtual world contains 10 roomswith a total of 155 unique objects that encompass 74different object types (e.g. chair or plant).

Figure 1: Snapshot of the situated treasure hunting envi-ronment

Table 1 shows a portion of a sample dialogue be-tween a user and the expert. Each Si represents asystem utterance and each Ui represents a user ut-terance. We focus on resolving exophoric referringexpressions, which are enclosed in brackets here. Inour dataset, an exophoric referring expression is anon-pronominal noun phrase that refers to an en-tity in the extralinguistic environment. It may bean evoking reference that initially refers to a newobject in the virtual world (e.g. an axe in utter-ance U2) or a subsequent reference to an entity in thevirtual world which has previously been mentionedin the dialogue (e.g. an axe in utterance U3). Inour study we focus on resolving exophoric referringexpressions because they are tightly coupled with auser’s eye gaze behavior.

From this study, we constructed a parallel spokenutterance and eye gaze corpus. Utterances, which

S1 Describe what you’re doing.U1 I just came out from the room that

I started and i see [one long sword]U2 [one short sword] and [an axe]S2 Compare these objects.U3 one of them is long and one of them

is really short, and i see [an axe]

Table 1: A conversational fragment demonstrating inter-action with exophoric referring expressions.

Utterance: i just came out from the room thati started and i see [one long sword]

Ht : . . . i 5210 see 5410 [one 5630long 6080 sword 6460]

H1: . . . icy 5210 winds 5630 along 6080so 6460 words 68000

H2: . . . icy 5210 [wine 5630] along 6080so 6460 words 6800

. . .H25: . . . icy 5210 winds 5630

[long 6080 sword 6460]. . .

Table 2: Sample n-best list of recognition hypotheses

are separated by a long pause (500 ms) in speech,are automatically recognized using the MicrosoftSpeech SDK. Gaze fixations are characterized byobjects in the virtual world that are fixated via auser’s eye gaze. When a fixation points to multi-ple spatially overlapping objects, only the one in theforefront is deemed to be fixated. The data corpuswas transcribed and annotated with 2204 exophoricreferring expressions amongst 2052 utterances from15 users.

4 Word Confusion Networks

For each user utterance in our dataset, an n-best list(with n = 100) of recognition hypotheses rankedin order of likelihood is produced by the MicrosoftSpeech SDK. One way to use the speech recogni-tion results (as in most speech applications) is to usethe top ranked recognition hypothesis. This may notbe the best solution because a large amount of infor-mation is being ignored. Table 2 demonstrates thisproblem. Here, the number after the underscore de-notes a timestamp associated with each recognizedspoken word. The strings enclosed in brackets de-

473

Page 4: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

note recognized referring expressions. In this exam-ple, the manual transcription of the original utter-ance is shown by Ht. In this case, the system mustfirst identify one long sword as a referring ex-pression and then resolve it to the correct set of en-tities in the virtual world. However, not until thetwenty fifth ranked recognition hypothesis H25, dowe see a referring expression closest to the actual ut-tered referring expression. Moreover, in utteranceswith multiple referring expressions, there may notbe a single recognition hypothesis that contains allreferring expressions, but each referring expressionmay be contained in some recognition hypothesis.Thus, it is desirable to consider the entire n-best listof hypotheses.

To address this issue, we adopted the word con-fusion network (WCN): a compact representationof a word lattice or n-best list (Mangu et al.,2000). A WCN captures alternative word hypothe-ses and their corresponding posterior probabilitiesin time-ordered sets. In addition to being compact,an important feature for efficient post-processingof recognition hypotheses for real-time systems,WCNs are capable of representing more competinghypotheses than either n-best lists or word lattices.Figure 2 shows an example of a WCN for the utter-ance “. . . I see one long sword” along with a timeline(in milliseconds) depicting the eye gaze fixations topotential referent objects that correspond to the ut-terance. The confusion network shows competingword hypotheses along with corresponding proba-bilities in log scale.

Using our data set, we can show that word con-fusion networks contain significantly more wordsthat can compose a referring expression than the toprecognition hypothesis. The confusion network key-word error rate (KWER) is 0.192 compared to a 1-best list KWER of 0.318, where a keyword is a wordthat can be contained in a referring expression. Theoverall WER for word confusion networks and 1-best lists are 0.315 and 0.460, respectively. The re-ported WCN word error rates are all oracle word er-ror rates reflecting the best WER that can be attainedusing any path in the confusion network. One moreimportant feature of word confusion networks is thatthey provide time alignment for words that occur atapproximately the same time interval in competinghypotheses. This is not only useful for efficient syn-

tactic parsing, which is necessary for identifying re-ferring expressions, but also critical for integrationwith time aligned gaze streams.

5 Reference Resolution Algorithm

We have developed an algorithm that combines ann-best list of speech recognition hypotheses with di-alogue, domain, and eye-gaze information to resolveexophoric referring expressions. There are three in-puts to the multimodal reference resolution algo-rithm for each utterance: (1) an n-best list of alter-native speech recognition hypotheses (n = 100 for aWCN and n = 1 for the top recognized hypothesis),(2) a list of fixated objects (by eye gaze) that tempo-rally correspond to the spoken utterance and (3) a setof potential referent objects. Since during the trea-sure hunting task people typically only speak aboutobjects that are visible or have recently been visibleon the screen, an object is considered to be a poten-tial referent if it is present within a close proximity(in the same room) of the user while an utterance isspoken.

The multimodal reference resolution algorithmproceeds with the following four steps:

Step 1: construct word confusion network Aword confusion network is constructed out of the in-put n-best list of alternative recognition hypotheseswith the SRI Language Modeling (SRILM) toolkit(Stolcke, 2002) using the procedure described in(Mangu et al., 2000). This procedure aligns wordsfrom the n-best list into equivalence classes. First,instances of the same word containing approxi-mately the same starting and ending timestamps areclustered. Then, equivalence classes with commontime ranges are merged. For each competing wordhypothesis its probability is computed by summingthe posteriors of all utterance hypotheses containingthis word. In our work, instead of using the actualposterior probability of each utterance hypothesis(which was not available), we assigned each utter-ance hypothesis a probability based on its positionin the ranked list. Figure 2 depicts a portion of theresulting word confusion network (showing compet-ing word hypotheses and their probabilities in logscale) constructed from the n-best list in Table 2.

474

Page 5: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

Figure 2: Sample parallel speech and eye gaze data streams, including a portion of the sample WCN

Step 2: extract referring expressions from WCNThe word confusion network is syntactically parsedusing a modified version of the CYK (Cooke andSchwartz, 1970; Kasami, 1965; Younger, 1967)parsing algorithm that is capable of taking a wordconfusion network as input rather than a singlestring. We call this the CYK-WCN algorithm. Todo the parsing, we applied a set of grammar ruleslargely derived from a different domain in our pre-vious work (Prasov and Chai, 2008). A parse chartof the sample word confusion network is shown inTable 3. Here, just as in the CYK algorithm thechart is filled in from left to right then bottom totop. The difference is that the chart has an addeddimension for competing word hypotheses. Thisis demonstrated in position 15 of the WCN, whereone and wine are two nouns that constitute com-peting words. Note that some words from the con-fusion network are not in the chart (e.g. winds)because they are out of vocabulary. The result ofthe syntactic parsing is that the parts of speech ofall sub-phrases in the confusion network are identi-fied. Next, a set of all exophoric referring expres-sions (i.e. non-pronominal noun phrases) found in

the word confusion network are extracted. Each re-ferring expression has a corresponding confidencescore, which can be computed in many many dif-ferent ways. Currently, we simply take the mean ofthe probability scores of the expression’s constituentwords. The sample WCN has four such phrases(shown in bold in Table 3): wine at position 15 withlength 1, one long sword at position 15 withlength 3, long sword at position 16 with length2, and sword at position 17 with length 1.

Step 3: resolve referring expressions Each re-ferring expression rj is resolved to the top k po-tential referent objects according to the probabil-ity P (oi|rj), where k is determined by informationfrom the linguistic expressions. P (oi|rj) is deter-mined using the following expression:

P (oi|rj) =AS(oi)

α × Compat(oi, rj)1−α∑i

AS(oi)α × Compat(oi, rj)1−α

(1)In this equation,

• AS: Attentional salience score of a particu-

475

Page 6: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

...54

length 3 NP→ NUM Adj-NP2 Adj-NP→ ADJ N,

NP→ Adj-NP1 (1) N→ wine, NP→ N ADJ→ long N→ sword,

(2) NUM→ one NP→ N... 14 15 16 17 18 ...

WCN position

Table 3: Syntactic parsing of word confusion network

lar object oi, which is determined based onthe gaze fixation intensity of an object at thestart time of referring expression rj . The fix-ation intensity of an object is defined as theamount of time that the object is fixated duringa predefined time windowW (Prasov and Chai,2008). As in (Prasov and Chai, 2008), we setW = [−1500..0] ms relative to the beginningof referring expression rj .

• Compat: Compatibility score, which specifieswhether the object oi is compatible with the in-formation specified by the referring expressionrj . Currently, the compatibility score is set to 1if referring expression rj and object oi have thesame object type (e.g. chair), and 0 otherwise.

• α: Importance weight, in the range [0..1], ofattentional salience relative to compatibility.A high α value indicates that the attentionalsalience score based on eye gaze carries moreweight in deciding referents, while a low αvalue indicates that compatibility carries moreweight. In this work, we set α = 0.5 to indicateequal weighting between attentional salienceand compatibility. If we do not want to inte-grate eye gaze in reference resolution, we canset α = 0.0. In this case, reference resolutionwill be purely based on compatibility betweenvisual objects and information specified via lin-guistic expressions.

Once all probabilities are calculated, each refer-ring expression is resolved to a set of referent ob-jects. Finally, this results in a set of (referring ex-pression, referent object set) pairs with confidencescores, which are determined by two components.

The first component is the confidence score of thereferring expression, which is explained in the Step1 of the algorithm. The second component is theprobability that the referent object set is indeed thereferent of this expression (which is determined byEquation 1). There are various ways to combinethese two components together to form an overallconfidence score for the pair. Here we simply mul-tiply the two components. The confidence score forthe pair is used in the following step to prune un-likely referring expressions.

Step 4: post-prune The resulting set of (referringexpression, referent object set) pairs is pruned to re-move pairs that fall under one of the following twoconditions: (1) the pair has a confidence score equalto or below a predefined threshold ε (currently, thethreshold is set to 0 and thus keeps all resolved pairs)and (2) the pair temporally overlaps with a higherconfidence pair. For example, in Table 3, the re-ferring expressions one long sword and wineoverlap in position 15. Finally, the resulting (refer-ring expression, referent object set) pairs are sortedin ascending order according to their constituent re-ferring expression timestamps.

6 Experimental Results

Using our data, described in Section 3, we appliedthe multimodal reference resolution algorithm de-scribed in Section 5. All of the data is used toreport the experimental results. Reference resolu-tion model parameters are set based on our priorwork in a different domain (Prasov and Chai, 2008).For each utterance we compare the reference reso-lution performance with and without the integrationof eye gaze information. We also evaluate using a

476

Page 7: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

word confusion network compared to a 1-best list tomodel speech recognition hypotheses. For perspec-tive, reference resolution with recognized speech in-put is compared with transcribed speech.

6.1 Evaluation Metrics

The reference resolution algorithm outputs a list of(referring expression, referent object set) pairs foreach utterance. We evaluate the algorithm by com-paring the generated pairs to the annotated “goldstandard” pairs using F-measure. We perform thefollowing two types of evaluation:

• Lenient Evaluation: Due to speech recognitionerrors, there are many cases in which the al-gorithm may not return a referring expressionthat exactly matches the gold standard refer-ring expression. It may only match based onthe object type. For example, the expressionsone long sword and sword are different,but they match in terms of the intended objecttype. For applications in which it is critical toidentify the objects referred to by the user, pre-cisely identifying uttered referring expressionsmay be unnecessary. Thus, we evaluate the ref-erence resolution algorithm with a lenient com-parison of (referring expression, referent objectset) pairs. In this case, two pairs are considereda match if at least the object types specified viathe referring expressions match each other andthe referent object sets are identical.

• Strict Evaluation: For some applications it maybe important to identify exact referring ex-pressions in addition to the objects they re-fer to. This is important for applications thatattempt to learn a relationship between refer-ring expressions and referenced objects. Forexample, in automated vocabulary acquisition,words other than object types must be identi-fied so the system can learn to associate thesewords with referenced objects. Similarly, insystems that apply priming for language gen-eration, identification of the exact referring ex-pressions from human users could be impor-tant. Thus, we also evaluate the reference reso-lution algorithm with a strict comparison of (re-ferring expression, referent object set) pairs. In

this case, a referring expression from the sys-tem output needs to exactly match the corre-sponding expression from the gold standard.

6.2 Role of Eye Gaze

We evaluate the effect of incorporating eye gazeinformation into the reference resolution algorithmusing the top best recognition hypothesis (1-best),the word confusion network (WCN), and the man-ual speech transcription (Transcription). Speechtranscription, which contains no recognition errors,demonstrates the upper bound performance of ourapproach. When no gaze information is used, ref-erence resolution solely depends on linguistic andsemantic processing of referring expressions. Table4 shows the lenient reference resolution evaluationusing F-measure. This table demonstrates that le-nient reference resolution is improved by incorpo-rating eye gaze information. This effect is statisti-cally significant in the case of transcription and 1-best (p < 0.0001 and p < 0.009, respectively) andmarginal (p < 0.07) in the case of WCN.

Configuration Without Gaze With GazeTranscription 0.619 0.676

WCN 0.524 0.5521-best 0.471 0.514

Table 4: Lenient F-measure Evaluation

Configuration Without Gaze With GazeTranscription 0.584 0.627

WCN 0.309 0.3331-best 0.039 0.035

Table 5: Strict F-measure Evaluation

Table 5 shows the strict reference resolution eval-uation using F-measure. As can be seen in the ta-ble, incorporating eye gaze information significantly(p < 0.0024) improves reference resolution per-formance when using transcription and marginally(p < 0.113) in the case of WCN optimized for strictevaluation. However there is no difference for the 1-best hypotheses which result in extremely low per-formance. This observation is not surprising since 1-best hypotheses are quite error prone and less likelyto produce the exact expressions.

477

Page 8: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

Since eye gaze can be used to direct navigation ina mobile environment as in situated dialogue, therecould be situations where eye gaze does not reflectthe content of the corresponding speech. In suchsituations, integrating eye gaze in reference reso-lution could be detrimental. To further understandthe role of eye gaze in reference resolution, we ap-plied our reference resolution algorithm only to ut-terances where speech and eye gaze are consideredclosely coupled (i.e., eye gaze reflects the content ofspeech). More specifically, following the previouswork (Qu and Chai, 2010), we define a closely cou-pled utterance as one in which at least one noun oradjective describes an object that has been fixated bythe corresponding gaze stream.

Table 6 and Table 7 show the performance basedon closely coupled utterances using lenient and strictevaluation, respectively. In the lenient evaluation,reference resolution performance is significantly im-proved for all input configurations when eye gazeinformation is incorporated (p < 0.0001 for tran-scription, p < 0.015 for WCN, and p < 0.0022 for1-best). In each case the closely coupled utterancesachieve higher performance than the entire set of ut-terances evaluated in Table 5. Aside from the 1-bestcase, the same is true when using strict evaluation(p < 0.0006 for transcription and p < 0.046 forWCN optimized for strict evaluation). This observa-tion indicates that in situated dialogue, some mech-anism to predict whether a gaze stream is closelycoupled with the corresponding speech content canbe beneficial in further improving reference resolu-tion performance.

Configuration Without Gaze With GazeTranscription 0.616 0.700

WCN 0.523 0.5701-best 0.473 0.537

Table 6: Lenient F-measure Evaluation for Closely Cou-pled Utterances

6.3 Role of Word Confusion NetworkThe effect of incorporating eye gaze with WCNsrather than 1-best recognition hypotheses into ref-erence resolution can also be seen in Tables 4 and5. Table 4 shows a significant improvement whenusing WCNs rather than 1-best hypotheses for both

Configuration Without Gaze With GazeTranscription 0.579 0.644

WCN 0.307 0.3451-best 0.045 0.038

Table 7: Strict F-measure Evaluation for Closely CoupledUtterances

with (p < 0.015) and without (p < 0.0012) eyegaze configurations. Similarly, Table 5 shows a sig-nificant improvement in strict evaluation when us-ing WCNs rather than 1-best hypotheses for bothwith (p < 0.0001) and without (p < 0.0001) eyegaze configurations. These results indicate that us-ing word confusion networks improves both lenientand strict reference resolution. This observation isnot surprising since identifying correct linguistic ex-pressions will enable better search for semanticallymatching referent objects.

Although WCNs lead to better performance, uti-lizing WCNs is more computationally expensivecompared to 1-best recognition hypotheses. Never-theless, in practice, WCN depth, which specifies themaximum number of competing word hypotheses inany position of the word confusion network, can belimited to a certain value |d|. For example, in Figure2 the depth of the shown WCN is 8 (there are 8 com-peting word hypotheses in position 17 of the WCN).The WCN depth can be limited by pruning word al-ternatives with low probabilities until, at most, thetop |d| words remain in each position of the WCN.It is interesting to observe how limiting WCN depthcan affect reference resolution performance. Figure3 demonstrates this observation. In this figure theresolution performance (in terms of lenient evalua-tion) for WCNs of varying depth is shown as dashedlines for with and without eye gaze configurations.As a reference point, the performance when utiliz-ing 1-best recognition hypotheses is shown as solidlines. It can be seen that as the depth increases, theperformance also increases until the depth reaches 8.After that, there is no performance improvement.

7 Discussion

In Section 6.2 we have shown that incorporatingeye gaze information improves reference resolu-tion performance. Eye-gaze information is particu-

478

Page 9: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

Figure 3: Lenient F-measure at each WCN Depth

larly helpful for resolving referring expressions thatare ambiguous from the perspective of the artificialagent. Consider a scenario where the user utters areferring expression that has an equivalent seman-tic compatibility with multiple potential referent ob-jects. For example, in a room with multiple books,the user utters “the open book to the right”, but onlythe phrase “the book” is recognized by the ASR. Ifa particular book is fixated during interaction, thereis a high probability that it is indeed being referredto by the user. Without eye gaze information, the se-mantic compatibility alone could be insufficient toresolve this referring expression. Thus, when eyegaze information is incorporated, the main source ofperformance improvement comes from better iden-tification of potential referent objects.

In Section 6.3 we have shown that incorporatingmultiple speech recognition hypotheses in the formof a word confusion network further improves ref-erence resolution performance. This is especiallytrue when exact referring expression identificationis required (F-measure of 0.309 from WCNs com-pared to F-measure of 0.039 from 1-best hypothe-ses). Using a WCN improves identification of low-probability referring expressions. Consider a sce-nario where the top recognition hypothesis of an ut-terance contains no referring expressions or an in-correct referring expression that has no semanticallycompatible potential referent objects. If a referringexpression with a high compatibility value to somepotential referent object is present in a lower proba-bility hypothesis, this referring expression can onlybe identified when a WCN rather than a 1-best hy-pothesis is utilized. Thus, when word confusion net-

works are incorporated, the main source of perfor-mance improvement comes from better referring ex-pression identification.

7.1 Computational Complexity

One potential concern of using word confusion net-works rather than 1-best hypotheses is that they aremore computationally expensive to process. Theasymptotic computational complexity for resolvingthe referring expressions using the algorithm pre-sented in this work with a WCN is the summa-tion of three components: (1) O(|G| · |d|2 · |w|3)for confusion network construction and parsing, (2)O(|r|·|O|·log(|O|)) for reference resolution, and (3)O(|r|2) for selection of (referring expression, ref-erent object set) pairs. Here, |w| is the number ofwords in the input speech signal (or, more precisely,the number of words in the longest ASR hypothesisfor a given utterance); |G| is the size of the parsinggrammar; |d| is the depth of the constructed wordconfusion network; |O| is the number of potentialreferent objects for each utterance; and |r| is thenumber of referring expressions that are extractedfrom the word confusion network.

The complexity is dominated by the word confu-sion network construction and parsing. Also, boththe number of words in an input utterance ASR hy-pothesis |w| and the number of referring expressionsin a word confusion network |r| are dependent on ut-terance length. In our study, interactive dialogue isencouraged and, thus, utterances are typically short;with a mean length of 6.41 words and standard de-viation of 4.35 words. The longest utterances in ourdata set has 31 words. WCN depth |d| has a mean of10.1, a standard deviation of 8.1, and a maximum 89words. In practice, as shown in Section 6.3, limiting|d| to 8 words achieves comparable reference resolu-tion results as using a full word confusion network.

To demonstrate the applicability of our referenceresolution algorithm for real-time processing, we ap-plied it on the data corpus presented in Section 3.This corpus contains utterances with a mean inputtime of 2927.5 ms and standard deviation of 1903.8ms. On a 2.4 GHz AMD Athlon(tm) 64 X2 DualCore Processor, the runtimes resulted in a real timefactor of 0.0153 on average. Thus, on average, anutterance from this corpus can be processed in justunder 45 ms, which is well within the range of ac-

479

Page 10: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

ceptable real-time performance.

7.2 Error Analysis

As can be seen in Section 6, even when using tran-scribed data, reference resolution performance stillhas room for improvement (achieving the highest le-nient F-measure of 0.700 when eye gaze is utilizedfor resolving closely coupled utterances). In thissection, we elaborate on the potential error sources.Specifically, we discuss two types of error: (1) a re-ferring expression is incorrectly recognized or (2) arecognized referring expression is not resolved to acorrect referent object set.

Given transcribed data, which simulates per-fectly recognized utterances, all referring expressionrecognition errors arise due to incorrect languageprocessing. Most of these errors occur because anincorrect part of speech (POS) tag is assigned to aword, or an out-of-vocabulary (OOV) word is en-countered, or the parsing grammar has insufficientcoverage. A particularly interesting parsing prob-lem occurs due to the nature of spoken language.Since punctuation is sometimes unavailable, givenan utterance with several consecutive nouns, it is un-clear which of these nouns should be treated as headnouns and which should be treated as noun modi-fiers. For example, in the utterance “there is a desklamp table and two chairs” it is unclear if the itali-cized expression should be parsed as a single phraseor as a list of (two) phrases a desk and lamp.Thus, some timing information should be used fordisambiguation.

Object set identification errors are more prevalentthan referring expression recognition errors. Themajority of these errors occur because a referringexpression is ambiguous from the perspective of theconversational system and there is not enough in-formation to choose amongst multiple potential ref-erent objects due to limited speech recognition anddomain modeling. One reason for this is that a re-ferring expression may be resolved to an incorrectnumber of referent objects. Another reason is that apertinent object attribute or a distinguishing spatialrelationship between objects specified by the usercannot be established by the system. For example,during the utterance “I see a vase left of the table”there are two vases visible on the screen creating anambiguity if the phrase left of is not processed

correctly. This is caused by an inadequate repre-sentation of spatial relationships and processing ofspatial language. One more reason for potential am-biguity is the lack of pragmatic knowledge that cansupport adequate inference. For example, when theuser refers to two sofa objects using the phrase “anarmchair and a sofa”, the system lacks pragmaticknowledge to indicate that arm chair refers tothe smaller of the two objects. Some of these errorscan be avoided when eye gaze information is avail-able to the system. However, due to the noisy natureof eye gaze data, many such referring expressionsremain ambiguous even when eye gaze informationis considered.

8 Conclusion

In this work, we have examined the utility of eyegaze and word confusion networks for reference res-olution in situated dialogue within a virtual world.Our empirical results indicate that incorporatingeye gaze information with recognition hypothesesis beneficial for the reference resolution task com-pared to only using recognition hypotheses. Further-more, using a word confusion network rather thanthe top best recognition hypothesis further improvesreference resolution performance. Our findings alsodemonstrate that the processing speed necessary tointegrate word confusion networks with eye gazeinformation is well within the acceptable range forreal-time applications.

Acknowledgments

This work was supported by IIS-0347548 and IIS-0535112 from the National Science Foundation. Wewould like to thank anonymous reviewers for theirvaluable comments and suggestions.

References

N. Bee, E. Andre, and S. Tober. 2009. Breaking theice in human-agent communication: Eye-gaze basedinitiation of contact with an embodied conversationalagent. In Proceedings of the 9th International Con-ference on Intelligent Virtual Agents (IVA’09), pages229–242. Springer.

T. Bickmore and J. Cassell, 2004. Social Dialogue withEmbodied Conversational Agents, chapter Natural, In-

480

Page 11: Fusing Eye Gaze with Speech Recognition Hypotheses to ...

telligent and Effective Interaction with Multimodal Di-alogue Systems. Kluwer Academic.

D. K. Byron, T. Mampilly, and T. Sharma, V.and Xu.2005. Utilizing visual attention for cross-modal coref-erence interpretation. In Spring Lecture Notes in Com-puter Science: Proceedings of CONTEXT-05, pages83–96.

N. J. Cooke and M. Russell. 2008. Gaze-contingent au-tomatic speech recognition. IET Signal Processing,2(4):369–380, December.

J. Cooke and J. T. Schwartz. 1970. Programming lan-guages and their compilers: Preliminary notes. Tech-nical report, Courant Institute of Mathematical Sci-ence.

R. Fang, J. Y. Chai, and F. Ferreira. 2009. Between lin-guistic attention and gaze fixations in multimodal con-versational interfaces. In The 11th International Con-ference on Multimodal Interfaces (ICMI).

P. Gorniak, J. Orkin, and D. Roy. 2006. Speech, spaceand purpose: Situated language understanding in com-puter games. In Twenty-eighth Annual Meeting ofthe Cognitive Science Society Workshop on ComputerGames.

Z. M. Griffin and K. Bock. 2000. What the eyes sayabout speaking. In Psychological Science, volume 11,pages 274–279.

D. Hakkani-Tur, F. Bechet, G. Riccardi, and G. Tur.2006. Beyond asr 1-best: Using word confusion net-works in spoken language understanding. ComputerSpeech and Language, 20(4):495–514.

M. A. Just and P. A. Carpenter. 1976. Eye fixations andcognitive processes. In Cognitive Psychology, vol-ume 8, pages 441–480.

T. Kasami. 1965. An efficient recognition and syntax-analysis algorithm for context-free languages. Scien-tific report AFCRL-65-758, Air Force Cambridge Re-search Laboratory, Bedford, Massachusetts.

J. Kelleher and J. van Genabith. 2004. Visual salienceand reference resolution in simulated 3-d environ-ments. Artificial Intelligence Review, 21(3).

Y. Liu, J. Y. Chai, and R. Jin. 2007. Automated vo-cabulary acquisition and interpretation in multimodalconversational systems. In Proceedings of the 45thAnnual Meeting of the Association of ComputationalLinguistics (ACL).

L. Mangu, E. Brill, and A. Stolcke. 2000. Finding con-sensus in speech recognition: word error minimizationand other applications of confusion networks. Com-puter Speech and Language, 14(4):373–400.

A. S. Meyer and W. J. M. Levelt. 1998. Viewing andnaming objects: Eye movements during noun phraseproduction. In Cognition, volume 66, pages B25–B33.

L.-P. Morency, C. M. Christoudias, and T. Darrell. 2006.Recognizing gaze aversion gestures in embodied con-versational discourse. In International Conference onMultimodal Interfaces (ICMI).

Y. I. Nakano, G. Reinstein, T. Stocky, and J. Cassell.2003. Towards a model of face-to-face grounding. InProceedings of the 41st Annual Meeting of the Associ-ation for Computational Linguistics (ACL’03), pages553–561.

Z. Prasov and J. Y. Chai. 2008. What’s in a gaze? therole of eye-gaze in reference resolution in multimodalconversational interfaces. In Proceedings of 13th In-ternational Conference on Intelligent User interfaces(IUI), pages 20–29.

S. Qu and J. Y. Chai. 2007. An exploration of eye gaze inspoken language processing for multimodal conversa-tional interfaces. In Proceedings of the Conference ofthe North America Chapter of the Association of Com-putational Linguistics (NAACL).

S. Qu and J. Y. Chai. 2010. Context-based word acquisi-tion for situated dialogue in a virtual world. Journal ofArtificial Intelligence Research, 37:347–377, March.

P. Qvarfordt and S. Zhai. 2005. Conversing with theuser based on eye-gaze patterns. In Proceedings Of theConference on Human Factors in Computing Systems.ACM.

C. L. Sidner, C. D. Kidd, C. Lee, and N. Lesh. 2004.Where to look: A study of human-robot engagement.In Proceedings of the 9th international conferenceon Intelligent User Interfaces (IUI’04), pages 78–84.ACM Press.

A. Stolcke. 2002. SRILM an extensible language model-ing toolkit, confusion network. In International Con-ference on Spoken Language Processing.

M. K. Tanenhous, M. Spivey-Knowlton, E. Eberhard, andJ. Sedivy. 1995. Integration of visual and linguisticinformation during spoken language comprehension.In Science, volume 268, pages 1632–1634.

D. H. Younger. 1967. Recognition and parsing ofcontext-free languages in time n3. Information andControl, 10(2):189–208.

481