Dialogue Context-Based Speech Recognition using User Simulation

Dialogue Context-Based Speech

Recognition using User Simulation

Ioannis Konstas

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

August 2008

i

©Copyright 2008

by

Ioannis Konstas

ii

Declaration

I hereby declare that this thesis is of my own composition, and that it contains no ma-

terial previously submitted for the award of any other degree. The work reported in

this thesis has been executed by myself, except where due acknowledgement is made

in the text.

Ioannis Konstas

iii

Abstract

Speech recognizers do not usually perform well in spoken dialogue systems due to their lack of linguistic knowledge and thus their inability to cope with the context of the dialogues in a similar way that humans do. This study, following the fashion of several previous efforts, attempts to build a post-processing system that will act as an intermediate filter between the speech recogniser and the dialogue system in an attempt to improve the accuracy of the former. In order to achieve this, it trains a Memory Based Classifier using features extracted from recognition hypotheses, acoustic information, n-best list distributional properties and most importantly a User Simulation model trained on dialogue data that simulates the way people predict the next dialogue move based on the discourse history. The system was trained on dialogue logs extracted using the TownInfo dialogue system and consists of a two-tier architecture, namely a classifier that ascribes to each hypothesis of the speech recogniser a confidence label and a re-ranker that extracts the hypothesis with the highest confidence label out of the n-best list. Overall the system exhibited a relative reduction of Word Error Rate (WER) of 5.13% and a relative increase of Dialogue Move Accuracy (DMA) of 4.22% compared to always selecting the topmost hypothesis (Baseline), thus capturing a 44.06% of the possible WER improvement on this data and 61.55% for the DMA measure, therefore validating the main hypothesis of this thesis, i.e. the User Simulation can effectively boost the speech recogniser's performance. Future work involves using a more elaborate semantic parser for the labelling of each hypothesis and evaluation of the system and the integration of the system to a real dialogue system such as the TownInfo System.

iv

Acknowledgements

I wish to warmly thank my supervisor Oliver Lemon for his constant guidance, sup-

port and time he spent for the completion of the project. I also wish to thank Kalliroi

Georgila, Xingun Liu, Helen Hastie and Sofia Morfopoulou for their co-operation

and helpful advice.

v

Contents

Declaration ...................................................................................................................ii

Abstract........................................................................................................................iii

Acknowledgements......................................................................................................iv

Contents........................................................................................................................v

Chapter 1 - Introduction................................................................................................1

1.1 Overview............................................................................................................3

1.2 Related Work......................................................................................................3

1.2.1 Topmost hypothesis classification..............................................................3

1.2.2 Re-ranking of n-best lists............................................................................4

1.3 User Simulation..................................................................................................7

1.4 TiMBL – Memory Based Learning....................................................................8

1.5 Evaluation Metrics.............................................................................................9

1.5.1 Classifier Metrics (Precision, Recall, F-Measure and Accuracy)...................9

1.5.2 Re-ranker Metrics (Word Error Rate, Dialogue Move Accuracy, Sentence

Accuracy)...............................................................................................................10

1.6 The TownInfo Dialogue System.......................................................................11

Chapter 2 - Methodology............................................................................................13

2.1 The Edinburgh TownInfo Corpus.....................................................................14

2.2 Automatic Labelling.........................................................................................18

2.3 Features............................................................................................................19

2.4 System Architecture.........................................................................................22

2.5. Experiments.....................................................................................................26

2.6. Baseline and Oracle ........................................................................................28

Chapter 3 - Results .....................................................................................................29

3.1 First Layer – Classifier Experiments................................................................30

3.2 Second Layer – Re-ranker Experiments..........................................................32

3.3. Significance tests: McNemar's test & Wilcoxon test......................................33

Chapter 4 - Experiment with high-level Features.......................................................35

vi

Chapter 5 - Discussion and Conclusions....................................................................37

5.1. Automatic Labelling........................................................................................38

5.2. Features...........................................................................................................39

5.3 Classifier..........................................................................................................42

5.4 Results..............................................................................................................42

5.5. Future work.....................................................................................................44

5.6. Conclusions.....................................................................................................44

References...................................................................................................................46

1

CHAPTER 1

Introduction

Speech recognizers, being essential modules in spoken dialogue systems, do not

alone provide adequate performance for a particular system to be robust and intuitive

enough to use. The most common reasons that usually account for erroneous recogni-

tions of the user's utterances is the ASR module's lack of linguistic knowledge and

their inability to perform well in noisy environments. In an attempt to compare them

with the human speech recognition subsystem, there is evidence that the latter is usu-

ally able to predict upcoming words if it is posed in a certain sufficiently constraining

linguistic context (so called 'high-Cloze' contexts) as in the case of domain-specific

dialogues even in situations where the levels of surrounding distracting noise is high

(Pickering et al., 2007).

This very interesting behaviour of the human brain leads us to believe that we can

simulate in a way its ability to correctly disambiguate possible misrecognitions of

what the user of a dialogue system intended to say. The most obvious way would be

to induce such linguistic context such as the history of the dialogue so far between

the user and the system to a post-processing system in an effort to boost the speech

recogniser's module accuracy.

Let us consider the following psycholinguistic theory of Pickering et al. (2007) who

advocate the fact that people go a step further and use their language production sub-

system in order to make predictions during comprehension of their co-speaker in a

dialogue: “if B overtly imitates A, then A's comprehension of B's utterance is facilit-

ated by A's memory for A's previous utterance.” With this in mind let us make the

following analogy:

Let us consider that A is the speaker-system in a dialogue session and B the user-hu-

man of this system. The user is said to imitate the system in terms both of words'

2

choice and semantics of the messages to be conveyed since he or she is asked to ful-

fil a certain scenario in a rather limited domain of interest, for example book a train

ticket, find a hotel/bar/restaurant, etc. Then the system (A) can understand better

what the user (B) said, because it “remembers”, i.e. it has stored, its previous actions

and turns.

Taking into account this interpretation, it can be justified that it would be rather use-

ful if we could model the dialogues between the user and the system and thus have

the system “remember” what it had said and done before, in order to better under-

stand what the user is really saying. The idea behind this theory can be approached

computationally by what is called User Simulation and is the main theme around

which this study revolves.

In an attempt to combine this theory with the speech recognition module of a dia-

logue system we consider that the latter produces several hypotheses for a given user

utterance as its output, namely an n-best list. The justification posed yields to the

conclusion that the topmost hypothesis out of this list might not be correct either in

terms of word recognition or semantic interpretation. Instead the correct hypothesis

might exist somewhere lower in this n-best list.

Note that in dialogue systems we are usually interested in the semantic representation

of an utterance, since it is usually sufficient for the system merely to understand what

the user wants to say, instead of the exact way he or she said it. Word alignment of

course may account for the level of confidence that the semantic interpretation is

truly the one meant to be conveyed by the user.

This study attempts to build a post-processing system that will take as its input the

speech recogniser's n-best lists of the user's utterances in dialogues and re-rank them

in an effort to extract the correct ones in terms both of semantic representation and

word alignment. In order to achieve this, it trains a Memory Based Classifier using

features extracted from recognition hypotheses, acoustic information, n-best list dis-

tributional properties and a User Simulation model trained on dialogue data.

3

1.1 Overview

The chapters of this thesis are organised as follows:

Chapter 1: Introduction to the problem of context-sensitive speech recognition and

previous work on this area.

Chapter 2: Detailed description of the methodology adopted to implement and train

the system discussed in the study.

Chapter 3: Results of the experiments conducted in order to train the system.

Chapter 4: Additional experiment with minimal number of high-level features.

Chapter 5: Discussion on the methodology and the results, future work and conclu-

sion.

1.2 Related Work

The notion of incorporating explicit knowledge to evaluate and refine the ASR hypo-

theses in the context of enhancing the dialogue strategy of a system, as assumed

above, is not something new. Several studies have been performed in an effort to

boost the performance of the speech recognizer following either of two different ap-

proaches to essentially the same problem: either make decisions on the topmost hy-

pothesis of the ASR's output or classify and then perform re-ranking of the n-best

lists. All the experiments of these studies were conducted on similar input data, i.e.

transcripts and wave files of user utterances and logs of dialogues. However, the sys-

tems they were extracted from were different as far as their target domain is con-

cerned and the magnitude of the corpora ranged from a few hundred utterances

(Gabsdil and Lemon, 2004), to several thousand utterances (Litman et al., 2000).

1.2.1 Topmost hypothesis classification

Litman et al. (2000) use prosodic cues extracted directly from speech waveforms

4

rather than confidence scores of the acoustic model incorporated in the speech recog-

nizer, in order to predict misrecognised user utterances in their corpus. In their exper-

iments they show that utterances containing word errors have certain prosodic fea-

tures. What is more, even simple acoustic features such as the energy of the captured

waveforms and their duration provide with good separation between correctly recog-

nised and misrecognised utterances. In this fashion they maintain that these features

can account for more accuracy than standard confidence scores. In order to distin-

guish between correct and incorrect recognitions they train a classifier using RIPPER

(Cohen, 1996), which is a tree decision model and build a set of binary rules based

on the entropy of each feature on a given training set. Their corpus consists of 544

dialogues between humans and three different dialogue systems; voice dialling and

messaging (Kamm et al., 1998), accessing email (Walker et al., 1998), accessing on-

line train schedules (Litman and Pan, 2000). Their best configuration scores 77.4%

accuracy, a 48.8% relative increase compared to their baseline.

Walker et al. (2000) use a combination of features from the speech recognizer, natur-

al language understanding, and dialogue history to attribute different classes to the

topmost hypothesis, namely: correct, partially correct, and misrecognised. Like Lit-

man et al. (2000) they also use RIPPER as their classifier. They train their system on

11.787 spoken utterances collected by AT&T's How may I help you corpus (Gorin et

al. 1997; Boyce and Gorin, 1996), consisting of dialogues over the telephone con-

cerning subscriptions' related scenarios. Their system achieves 86% accuracy, an im-

provement of 23% over the baseline.

1.2.2 Re-ranking of n-best lists

On the other hand, Chotimongkol and Rudnicky (2001), Gabsdil and Lemon (2004),

Jonson (2006) and Andersson (2006) move a step further than simply classifying the

topmost hypothesis and perform re-ranking of the n-best lists using prosodic and

speech recognition features as well as dialogue context and task-related attributes.

Chotimongkol and Rudnicky (2001) train a linear regression model on acoustic, syn-

tactic and semantic features in order to reorder the n-best hypotheses for a single ut-

5

terance. Each hypothesis in the list is ascribed a correctness score, namely its relative

word error rate in the list. Then the one that scores lower is chosen instead of the

top-1 result. The corpus used is extracted from the Communicator system (Rudnicky

et al., 2000) regarding travel planning and consists of 35766 utterances for which the

25-best lists are taken into consideration. The performance of the re-ranker is 11.97%

WER resulting in a 3.96% relative reduction compared to the baseline

Gabsdil and Lemon (2004) similarly perform reordering of n-best lists by combining

acoustic and pragmatic features. Their study shows that the dialogue features such as

the previous system question and if a hypothesis is the correct answer to a particular

question contributed more than the other more common attributes. Each hypothesis

in the n-best list is automatically labelled as being either in-grammar, out-of-gram-

mar (oog) (WER <= 50), out-of-grammar (oog) (WER > 50) or crosstalk. This la-

belling is based on a combination of the semantic parse of each hypothesis and its

alignment with the true transcript. Their approach to the problem is in two steps: first

they use TiMBL (Daelemans et al., 2007), a memory based classifier, in order to pre-

dict the correct label of each hypothesis in the n-list and then they perform a simple

re-ranking by choosing the hypothesis that has the most significant label (if it exists

in the list) according to the order: in-grammar < oog (WER ≤ 50) < oog (WER > 50)

< crosstalk. The corpus used was extracted with the WITAS dialogue system (Lem-

on, 2004) and consisted of interactions with a simulated aerial vehicle, a total of 30

dialogues with 303 utterances, the 10-best lists of which were taken into considera-

tion. Their system performed 25% relatively better than the baseline with a weighted

f-score of 86.38%.

Jonson (2006) classifies recognition hypotheses with quite similar labels denoting ac-

ceptance, clarification, confirmation and rejection. These labels have been automatic-

ally crafted in an equivalent manner as in the Gabsdil and Lemon (2004) study and

correspond to varying levels of confidence, being essentially potential directives to

the dialogue manager. Apart from the common features Jonson includes close-con-

text features, e.g. previous dialogue moves, slot fulfilment as well as the dialogue

history. She also includes attributes that account for the whole n-best list, i.e. stand-

ard deviation of confidence scores etc. Jonson (2006) also uses TiMBL in order to

6

classify each hypothesis of the n-best list to one of the 5 labels incorporated and uses

the same re-rank algorithm as Gabsdil and Lemon (2004) to choose the top-1 hypo-

thesis. Her system got trained on the GoDiS corpus, comprising of dialogues dealing

with a virtual jukebox which consist of 486 user utterances the 40-best lists of which

were taken into account. Her optimal set-up scored 83% of DMA and 58% of SA

(see section 1.5.2 for explanation of these measures), gaining a 56.60% of relative in-

crease of the DMA and 20.83% for the SA measure compared to the baseline.

Andersson (2006) uses similar acoustic, list and dialogue features but adheres to a

simpler binary annotation characterizing whether each hypothesis of the ASR n-best

list is close enough ('B') or not ('N') to the original transcript. For the classification

purposes of the given problem he trains maximum entropy models and performs a

simple re-rank by choosing the first hypothesis, if it exists, which belongs to the 'B'

category. His corpus is taken from the Edinburgh TownInfo system, containing dia-

logues for booking of hotels/bars/restaurants (see section 2.1) and consists of 191

dialogues or 2904 utterances taking on average into consideration on average the 7-

best lists. He scores an absolute improvement of error of 4.1% which interprets to a

relative improvement of 44.5% compared to the baseline.

Gruenstein (2008) follows a somewhat different approach to the problem of re-rank-

ing by considering the prediction of the system's response rather than the user's utter-

ance in the context of a multi-modal dialogue system. Along with the common recog-

nition and distributional features of the hypotheses in the n-best lists he takes into ac-

count features that deal with the response of the system to the n-best list produced by

the speech recogniser. Similarly to Andersson (2006) he labels each hypothesis as

'acceptable' or 'unacceptable' depending on the semantic match with the true tran-

script. He then trains an SVM to predict either of the two classes and then fits a lin-

ear regression model to the classification output of the SVM in order to output a con-

fidence score between -1 and +1, with -1 being totally 'unacceptable' and +1 totally

'acceptable'. Re-ranking is then performed by setting a threshold in the domain

[-1,+1] and choosing the hypothesis that exceeds this threshold. Unlike the previous

the studies this method is able to output a numerical confidence score rather than a

discrete label. His system is trained on data taken from the City Browser multi-modal

7

system (Gruenstein et al., 2006; Gruenstein and Seneff, 2007), resulting in 1912 ut-

terances. His system scored an absolute 72% of F1-measure yielding 16% improve-

ment compared to the baseline.

1.3 User Simulation

What makes this study different from the previous work in the area of post-pro-

cessing of the ASR hypotheses is the incorporation of a User Simulation output as an

additional feature in my own system. Undeniably, the history of the discourse

between a user and a dialogue system plays an important role as to what might be ex-

pected from the user to say next. As a result, most of the studies mentioned in the

previous section make various efforts to capture it by including relevant features dir-

ectly in their classifiers. Although this may account for simplicity and performance in

the runtime of their system, still they fail to some extent to adopt a more systematic

way in coping with user behaviour along the dialogue.

A User Simulation model is what comes in hand to fill this gap by getting trained on

small corpora of dialogue data in order to simulate real user behaviour. In my system

I have used a User Simulation model created by Georgila et al. (2006) based on n-

grams of dialogue moves. Essentially, it treats a dialogue as a sequence of lists of

consecutive user and system turns in a certain high level semantic representation, i.e.

{<Speech Act>, <Task>} pairs. (see section 2.1 for complete explanation of this se-

mantic representation). It takes as input the n - 1 most recent lists of {<Speech Act>,

<Task>} pairs in the dialogue history, and uses the statistics of n-grams in the train-

ing set to decide on the next user action. If no n-grams match the current history, the

model can back-off to n-grams of lower order.

The benefit from using n-gram models in order to simulate user actions is that they

are fully probabilistic and are fast to train even on large corpora. A main drawback

which is common in the case of n-grams is that they are considered to be quite local

in their predictions. In other words, given a history of n – 1 dialogue turns, the pre-

diction of the n-th user turn may be too dependent on the previous ones and thus

might not make much sense in the more global context of the dialogue.

8

The main hypothesis of this study is that by using the User Simulation model to pre-

dict the next dialogue move of the user utterance as a feature in my system, I shall ef-

fectively increase the performance of the speech recogniser module.

1.4 TiMBL – Memory Based Learning

In this study I chose TiMBL 6.1 (Daelemans et al. 2007) as the main model for the

classification of the hypotheses in the n-best list. TiMBL is considered to be a well

established, open-source and efficient C++ software, already used with considerable

results by Gabsdil and Lemon (2004) and Jonson (2006).

Memory-Based Learning (MBL) is an elegantly simple and robust machine learning

method which has been applied to a multitude of tasks in Natural Language Pro-

cessing (NLP). MBL descends directly from the plain k-Nearest Neighbour (k-NN)

method of classification, which is still considered to be a quick, yet powerful pattern

classification algorithm .

Though plain k-NN performs well in various applications it is notoriously inefficient

in its runtime use, since each test vector needs to be compared to all the training data.

As a result, since classification speed is a critical issue in any realistic application of

MBL, non-trivial data-structures and speed-up optimizations have been employed in

TiMBL. Typically, training data are compressed and represented into a decision-tree

structure.

In general, MBL is founded on the hypothesis that “performance in cognitive tasks is

based on reasoning on the basis of similarity of new situations to stored representa-

tions of earlier experiences, rather than on the application of mental rules abstracted

from earlier experiences” (Daelemans et al., 2007). TiMBL, as every common ma-

chine learning method, is divided into two parts, namely the learning component

which is memory-based, and the performance component which is similarity-based.

The learning component of MBL is memory-based and it merely involves adding

training instances to memory. This process is sometimes referred to as “lazy” since

storing into memory takes place without some form of abstraction or intermediate

9

representation. An instance consists of a fixed-length vector of n feature-value pairs,

plus the class that this particular vector belongs to.

In the performance component of an MBL system, the training vectors are used in or-

der to perform classification of a previously unseen test datum. The similarity

between the new instance X and all training vectors Y in memory is computed using

some distance metric ∆(X, Y). The extrapolation is done by assigning the most fre-

quent class within the found subset of most similar examples, the k-nearest neigh-

bours, as the class of the new test datum. If there exists a tie among classes, a certain

tie breaking algorithm is applied.

In order to compute the similarity between a test datum and each training vector I

chose to use IB1 with information-theoretic feature weighting (among other MBL

implementations found in the TiMBL library). The IB1 algorithm calculates this sim-

ilarity in terms of weighted overlap: the total difference between two patterns is the

sum of the relevance weights of those features which are not equal. The class for the

test datum is decided on the basis of the least distant item(s) in memory.

To compute relevance, Gain Ratio is used which is essentially the Information Gain

of each feature of the training vectors divided by the entropy of the feature-values .

Gain Ratio is considered to be a normalised version of the Information Gain measure

(according to which each feature is considered independent of the rest, and measures

how much information each of these contributes to our knowledge of the correct

class label ).

1.5 Evaluation Metrics

In this section I shall introduce the six metrics that were used in the evaluation of the

two core components of the system, namely the classifier and re-ranker.

1.5.1 Classifier Metrics (Precision, Recall, F-Measure and Accuracy)

Precision of a given class X is the ratio of the vectors that were classified correctly to

10

class X (True Positive) to the total number of vectors that were classified as class X

either correctly or not (True Positive + False Positive).

True Positive Precision = (1.1) True Positive + False Positive

Recall of a given class X is the ratio of the vectors that were classified correctly to

this class (True Positive) to the total number of vectors that actually belong to class

X, in other words to the number of vectors that were correctly classified as class X

plus the number of vectors that were incorrectly not classified as class X (True Posit-

ive + False Negative).

True Positive Recall = (1.2) True Positive + False Negative

F-measure is a combination of precision and recall. The general formula of this met-

ric is (b is a non-negative real valued constant):

(b2 + 1) Precision Recall ⋅ ⋅ F-measure = (1.3) b2 Precision + Recall ⋅

In this study we use the formula of F1 (b = 1), which gives equal gravity to precision

and recall and is also called the weighted harmonic mean of precision and recall:

2 Precision Recall ⋅ ⋅ F1 = (1.4) Precision + Recall

Accuracy of the classifier in total is the ratio of the vectors that were correctly classi-

fied in their classes to the total number of vectors that exist in the test set.

1.5.2 Re-ranker Metrics (Word Error Rate, Dialogue Move Accuracy, Sentence Accuracy)

Word Error Rate (WER) is the ratio of the number of deletions, insertions and substi-

11

tutions in the transcription of a hypothesis as compared to the true transcript to the

total number of words in the true transcript. In my system I compute it by measuring

the Levenshtein distance between the hypothesis and the true transcript.

Deletions + Insertions + Substitutions WER = (1.5) Length of Transcript

Dialogue Move Accuracy (DMA) is a variant of the Concept Error Rate (CER) as

defined by Boros et al. (1996), which takes into account the semantic aspects of the

difference between the classified utterance and the true transcription. CER is similar

in a sense to the WER, since it takes into account deletions, insertions and substitu-

tions but to the semantic rather than the word level of the utterance. In our case DMA

is stricter than CER, in the sense that it does not allow for partial matches in the se-

mantic representation. In other words, if the classified utterance corresponds to the

same semantic representation as the transcribed then we have 100% DMA, otherwise

0%.

Sentence Accuracy (SA) is the alignment of a single hypothesis in the n-best list with

the true transcription. Similarly to DMA, it accounts for perfect alignment between

the hypothesis and the transcription, i.e. if they match perfectly we have 100% SA,

otherwise 0%.

1.6 The TownInfo Dialogue System

The training datasets used in this study were collected from user (both native and

non-native) interactions with the TownInfo dialogue system (Lemon, Georgila and

Henderson, 2006) developed within the TALK project (http://www.talk-project.org).

The TALK TownInfo system is an experimental specific domain system where pre-

sumptive users interact with it via natural speech in order to book a room in a hotel, a

table in a restaurant or try to find a bar. Each user was given a specific scenario to

fulfil involving subtasks of preferred choice regarding price range, location and type

http://www.talk-project.org/

12

of facility (Lemon, Georgila, Henderson and Stuttle, 2006). The dialogue system is

implemented in the Open Agent Architecture (OOA) (Cheyer and Martin, 2001) with

the main components being a dialogue manager, a dialogue policy reinforcement

learner, a speech recogniser and a speech synthesiser. The input to my system are

given by the dialogue manager and the speech recogniser.

The dialogue manager, DIPPER (Bos et al., 2003), is an Information State Update

(ISU) approach to dialogue management that was specifically developed to handle

spoken input/output and integrates several communicating software agents. These in-

clude agents that monitor dialogue progress and facilitate other agents to decide as to

what action should be taken based on previous and current state of the dialogue. The

output of DIPPER which is of particular interest for us is the logging of the flow of

the dialogues with information such as user utterances, system output (both the tran-

script and semantic representation), current user task etc. The goal of the system is

to guide the dialogue manager’s clarification and confirmation strategies that is then

given to the speech synthesiser to realise (Lemon, Georgila, Henderson and Stuttle,

2006).

The speech recogniser was built with the ATK tool-kit (Young, 2004). The recogniser

models natural speech using Hidden Markov Models and utilizes n-grams as its lan-

guage model integrating in this way domain-specific data with wide coverage data

instead of a domain dependent recognition grammar network. What is more, it oper-

ates in an n-best mode, which means that it produces the top-n hypotheses that were

recognised against the recorded speech of the user, ordered by the model's overall

confidence score.

13

CHAPTER 2

Methodology

The outcome of this study is a standalone software written in JAVA and C/C++ that

performs re-ranking of the n-best lists produced by the speech recogniser in the con-

text of a dialogue system. The input to the system is the n-best list that corresponds

to a user utterance along with the confidence score per utterance and per word of

each utterance and the dialogue log which contains the turns of both the system and

the user. The output of the system is a single hypothesis that has been chosen from

the n-best list along with a label that corresponds to a degree of “certainty” as to the

correctness of the picked hypothesis.

In order to extract the correct hypothesis the system includes a Memory Based Clas-

sifier (TiMBL - Daelemans et al., 2007) that has been trained to the Edinburgh Town-

Info corpus, consisting of 126 dialogues containing 1554 user utterances. For each

utterance I used the 60-best lists produced by the recogniser resulting in a total of

93240 hypotheses. Each hypothesis was automatically labelled with one of the 'opt',

'pos', 'neg' and 'ign' labels denoting a decreasing order of confidence as to the match

with the true transcript and semantic representation of the user's utterance, with 'opt'

representing the most confident estimate.

The final testing of the classifier was performed on a separate part of the Edinburgh

TownInfo corpus consisting of 58 dialogues containing 510 utterances. For each ut-

terance I used the full 60-best lists resulting in 30200 hypotheses. The testing of the

re-ranker was performed with a 10-fold cross-validation on the whole training set.

The reason for choosing cross-validation is that it accounts for more significance of

the results and overall robustness of the system in its ability to boost the recogniser's

performance.

The classifier was trained with four different sets of increasing number of features;

14

the first set included only list features such as standard deviation of confidence

scores, the second included the first plus current hypothesis features such as hypo-

thesis length, confidence score etc., the second included the the previous two plus

acoustic features and the fourth all previous plus the User Simulation score.

2.1 The Edinburgh TownInfo Corpus

The input to the system is extracted from the Edinburgh TownInfo corpus which con-

sists of a total of 184 dialogues an example of which is given in Figure 2.1. The cor-

pus was already captured and transcribed by K. Georgila since it is a part of the train-

ing set used in the User Simulation model also incorporated in my system. In more

detail, the corpus contains a collection of interactions with the TownInfo dialogue

system by native and non-native users (see section 1.6).

Figure 2.1: A dialogue log from the Edinburgh TownInfo corpus (Andersson 2006)

System: How can I help you?User: I need to find a hotel roomSystem: Did you say hotel?User: YesSystem: Do you want something central?User: Yes pleaseSystem: Do you need a single or double room?User: [empty]System: Do you need a single or double room?User: [empty]System: Do you need a single or double room?User: SingleSystem: Would you like something cheap, expensive or reasonably priced?User: I don't mindSystem: Now I am going to search for a single central room. Please wait a moment while I search for suitable hotel options.(system presents options)Are you interested in any hotel option number or would you like me to keep going?User: Number three(system repeats option number three)System: Are you happy with that choice?User: YesSystem: Would you like to search some more?User: NoSystem: Goodbye...and thanks for using the Edinburgh Cambridge dialogue system.

15

Figure 2.2: Part of an n-best list for the transcription 'Something cheap'. The second column denotes the acoustic score of the speech recogniser

Each utterance is contained in various formats depending on the context we are fo-

cusing our attention on. On the highest level we have a collection of dialogue logs

which are structured in accordance to the Information State Update (ISU) paradigm

as shown in Figure 2.3.

Apart from the transcript of the user or system's utterance (shown in bold) the logs

also contain a semantic representation for the limited knowledge domain of hotels,

bar and restaurants that denote the current Dialogue Move. More specifically, each

utterance is transcribed in the following format: <Speech Act>, <Task>, <Slot

Value> (shown in red in the example of Figure 2.3 with the equivalent values

filled).The Speech Act field is a high-level representation of the type of the sentence

that was uttered by the user/system and takes values such as provide_info, yes_an-

swer, which mean that the user/system tries to convey some domain-specific inform-

Something cheap

<s> SOMETHING CHEAP </s> -15268.4<s> SOMETHING A CHEAP </s> -15283.8<s> I THINK CHEAP </s> -15294.8<s> UH SOMETHING CHEAP </s> -15287.4<s> SOMETHING CHEAPER </s> -15287.3<s> SOMETHING CHEAP I </s> -15276.6<s> I DON'T CHEAP </s> -15307.4<s> I SOMETHING CHEAP </s> -15310.5<s> I WANT A CHEAP </s> -15383.5<s> A SOMETHING CHEAP </s> -15287.4<s> SOMETHING CHEAP A </s> -15259.5<s> SOMETHING CHEAP UH </s> -15259.5<s> I HAVE A CHEAP </s> -15396.8<s> I WANT CHEAP </s> -15327.8<s> THE SOMETHING CHEAP </s>-15311.4<s> UH THANK CHEAP </s> -15270.5<s> AH SOMETHING CHEAP </s> -15291.3<s> ER SOMETHING CHEAP </s> -15300<s> SOMETHING CHEAP AT </s> -15261.9<s> I THINK A CHEAP </s> -15336.4

16

ation, answers affirmatively respectively.

Figure 2.3: Excerpt from a dialogue log containing the most useful fields, showing the Information State fields for the user utterance 'chinese'

The Task field is a lower-level representation of the contents of the uttered by the

user/system and takes values such as top_level_trip, food_type, which summarise the

fact that the user/system has made a general statement for a hotel, bar or restaurant or

a statement for the type of food respectively. Finally, the Slot Value field is the low-

est-representation of the message conveyed and usually corresponds to specific in-

formation such as chinese if the Task Field has the value of food_type, cheap if the

TypeOfPolicy: 1STATE 7DIALOGUE LEVELTurn: userTurnNumber: 3Speaker: userDialogueActType: userConvDomain: about_taskSpeechAct: [provide_info]AsrInput: chinese TransInput: Output: TASK LEVELTask: [food_type]FilledSlot: [food_type]FilledSlotValue: [chinese]LOW LEVELAudioFileName: kirsten-003--2006-11-06_12-30-13.wavConfidenceScore: 0.44HISTORY LEVELPreviouslyFilledSlots: [null],[top_level_trip],[null],[food_type]PreviouslyFilledSlotsValues: [null],[restaurant],[],[chinese]PreviouslyGroundedSlots: [null],[null],[top_level_trip],[]SpeechActsHist: opening_closing,request_info,[provide_info,provide_info],explicit_confirm,[yes_answer,yes_answer],request_info,[provide_info]TasksHist: meta_greeting_goodbye,top_level_trip,[top_level_trip,food_type],top_level_trip,[top_level_trip,food_type],food_type,[food_type]FilledSlotsHist: [top_level_trip,food_type],[],[food_type]FilledSlotsValuesHist: [restaurant,chinese],[],[chinese]

17

Task Field is filled with hotel_price etc

For the purposes of the experiments I have used a different cut-down version of the

ISU logs that contain just the semantic parses of the dialogue moves of the systems'

and users' turns and the file names of the wave files that correspond to the users ut-

terances.

For each utterance we have a series of files of 60-best lists produced by the speech

recogniser, namely the transcription hypotheses on a sentence level along with the

acoustic model score (Figure 2.2) and the equivalent transcriptions on a word level,

with information such as the duration of each recognised frame and the confidence

score of the acoustic and language model of each word (Figure 2.4). Finally, there

exist the wave files of each utterance which were used to compute various acoustic

features.

Figure 2.4: Speech recogniser's output at a word level for the transcript 'Something cheap'. The columns correspond to: start of frame, end of frame, label, language modelling, acoustic, total and confidence score.

Something cheap

0 6000000 <s> 15.000000 -3306.653320 -3291.653320 0.9031416000000 10900000 SOMETHING -69.225189 -3832.953613 -3902.178711 0.77487310900000 14000000 CHEAP -16.965895 -2162.810547 -2179.776367 0.95097314000000 25900000 </s> 6.578006 -5965.947266 -5959.369141 0.935400///0 6000000 <s> 15.000000 -3306.653320 12041.324219 0.9031416000000 10300000 SOMETHING -69.225189 -3324.001465 -3393.226562 0.78582710300000 11000000 A -42.978447 -608.698303 -651.676758 0.51422211000000 14000000 CHEAP -17.142681 -2078.526367 -2095.668945 0.95485414000000 25900000 </s> 6.578006 -5965.947266 -5959.369141 0.935400///0 6700000 <s> 15.000000 -3828.631348 11577.962891 0.8906816700000 8800000 I -20.461586 -1653.962280 -1674.423828 0.7206948800000 11200000 THINK -33.112690 -1921.326782 -1954.439453 0.78422211200000 14000000 CHEAP -73.240974 -1924.966553 -1998.207520 0.95729914000000 25900000 </s> 6.578006 -5965.947266 -5959.369141 0.935400

18

2.2 Automatic Labelling

In order to perform the re-ranking of the n-best lists we have to rely on some meas-

ure of correctness of each hypothesis. In other words we need to distinguish among

those that are supposed to be close enough to the true transcript or not. Instead of ad-

opting the industry-standard measure of closeness for speech recognisers, namely

WER, I adhered to a less strict hybrid method that combines primarily the DMA and

then the WER of each hypothesis. What is more, in order to induce some kind of dis-

crete confidence scoring that can guide or at least facilitate the dialogue manager to

choose for a particular strategy move. I have devised four labels with decreasing or-

der of confidence: 'opt', 'pos', 'neg', 'ign'. These are automatically generated by using

two different modules: a keyword parser that computes the {<Speech Act><Task>}

pair as described in the previous section and a Levenshtein Distance calculator for

the computation of the DMA and WER of each hypothesis respectively.

The reason for opting towards a more abstract level, namely the semantics of the hy-

potheses rather than delving into the lower level of individual word recognition, is

that in Dialogue Systems it is usually sufficient to rely on the message that is being

conveyed by the user rather than the words that he or she used.

Similar to Gabsdil and Lemon (2004) and Jonson (2006) I ascribed to each utterance

either of the 'opt', 'pos', 'neg', 'ign' labels according to the following schema:

• opt: The hypothesis is perfectly aligned and semantically identical to the tran-

scription

• pos: The hypothesis is not entirely aligned (WER ≤ 50) but is semantically

identical to the transcription

• neg: The hypothesis is semantically identical to the transcription but does not

align well (WER > 50) or is semantically different compared to the transcrip-

tion

• ign: The hypothesis was not addressed to the system (crosstalk), e.g. the user

laughed, coughed, etc.

19

The 50% value for the WER as a threshold for the distinction between the 'pos' and

'neg' category is adopted from Gabsdil (2003), based on the fact that WER is affected

by concept accuracy (Boros et al. 2003). In other words, if a hypothesis is erroneous

as far as its transcript is concerned then it is highly likely that it does not even convey

the correct message from a semantic point of view.

It can be clearly seen that I am always labelling conceptually equivalent hypotheses

to a particular transcription as potential candidate dialogue strategy moves and total

misrecognitions as rejections. In Figure 2.5 we can see some examples of the four la-

bels. Notice that in the case of silence, we give an opt to the empty hypothesis.

Transcript: I'd like to find a bar please Transcipt: silence

I WOULD LIKE TO FIND A BAR PLEASE pos - opt

I LIKE TO FIND A FOUR PLEASE neg MM ign

I'D LIKE TO FIND A BAR PLEASE opt HM ign

WOULD LIKE TO FIND THE OR PLEASE ign UM ign

2.3 Features

All the features used by the system are extracted by the dialogue logs, the n-best lists

per utterance and per word and the audio files. The majority of the features chosen

are based on their success in previous systems as described in the literature. The nov-

el feature of course is the User Simulation score which may make redundant most of

the equivalent dialogue features met in other studies.

In order to measure the usefulness of each candidate feature and thus choose the

most important I used the common metrics of Information Gain and Gain Ratio (see

section 1.4 for a very brief explanation) on the whole training set, i.e. 93240 hypo-

theses.

In total I extracted 13 attributes that can be grouped into 4 main categories; those that

concern the current hypothesis to be classified, those that concern low-level statistics

Figure 2.5: Examples of the four labels: opt, pos, neg and ign

20

of the audio files, those that concern the whole n-best list, and finally the user simu-

lation feature:

1. Current Hypothesis Features (6): acoustic score, overall model confidence

score, minimum word confidence score, grammar parsability, hypothesis

length and hypothesis duration.

2. Acoustic Features (3): minimum, maximum and RMS amplitude

3. List Features (3): n-best rank, deviation of confidence scores in the list,

match with most frequent Dialogue Move

4. User Simulation (1): User Simulation confidence score

The current hypothesis features were extracted from the n-best list files that con-

tained the hypotheses' transcription along with overall acoustic score (Figure 2.4) per

utterance and from the equivalent files that contained the transcription of each word

along with the start of frame, end of frame and confidence score:

Acoustic score is the negative log likelihood that is ascribed by the speech re-

cogniser to the whole hypothesis, being the sum of the individual word acous-

tic scores. Intuitively this is considered to be helpful since it depicts the con-

fidence of the statistical model only for each word and is also adopted in pre-

vious studies. Incorrect alignments shall tend to adapt less well to the model

and thus have low log likelihood.

Overall model confidence score is the average of the individual word confid-

ence scores. In the lack of the real model confidence scores in the given files

of the corpus, I adhered to the average of each word confidence score as the

next best approach to the models' overall confidence taking into account both

the language and acoustic model.

Minimum word confidence score is also computed by the individual word

transcriptions and accounts for the confidence score of the word for which the

speech recogniser is least certain of. It is expected to help our classifier distin-

guish between poor overall hypotheses' recognitions since a high overall con-

fidence score can sometimes prove to be misleading.

21

Grammar Parsability is the negative log likelihood of the transcript for the

current hypothesis as produced by the Stanford Parser, a wide-coverage Prob-

abilistic Context-Free Grammar (PCFG) (Klein et al. 2003, http://nlp.stan-

ford.edu/software/lex-parser.shtml). This feature seems helpful since we ex-

pect that a highly ungrammatical hypothesis is likely not to match with the

true transcription semantically.

Hypothesis duration is the length of the hypothesis in milliseconds as extrac-

ted from the n-best list files with transcriptions per word that include the start

and the end time of the recognised frame. The reason for the inclusion of this

feature is that can help distinguish between short utterances such as yes/no

answers, medium-sized utterances of normal answers and long utterances

caused by crosstalk.

Hypothesis length is the number of words in a hypothesis and is considered to

help in a similar way as the above feature.

The acoustic features were extracted directly from the wave files using SoX, an in-

dustry-standard open-source audio editing and converter utility in *NIX environ-

ments:

Minimum, maximum and RMS amplitude are pretty straightforward features

rather common in all previous studies mentioned in section 1.2.

The list features were calculated based on the n-best list files with transcriptions per

utterance and per word and take into account the whole list:

N-best rank is the position of the hypothesis in the list and could be useful in

the sense that 'opt' and 'pos' are usually found in the upper part of the list

rather than the bottom.

Deviation of confidence scores in the list is the deviation of the overall model

confidence score of the hypothesis from the mean confidence score in the list.

This feature is extracted in the hope that it will indicate potential clusters of

confidence scores in particular positions in the list, i.e. group hypotheses that

deviate in a specific fashion from the mean and thus indicating them being

classified with the same label.

http://nlp.stanford.edu/software/lex-parser.shtml

http://nlp.stanford.edu/software/lex-parser.shtml

22

Match with most frequent Dialogue Move is the only boolean feature crafted

and indicates whether the Dialogue Move of the current hypothesis, i.e. the

pair of {<Speech Act><Task>} coincides with the most frequent one. The

trend in n-best lists is to have a majority of utterances that belong to one or

two labels and only one hypothesis belonging to the 'opt' and/or a few to the

'pos'. As a result, the idea behind this feature is to extract such potential out-

liers which are the desired goal for the re-ranker.

Finally, the user simulation score is given as an output from the User Simulation

model created by K. Georgila and adapted for the purposes of this study (see next

section for more details). The model is operating with 5-grams. The input to it is giv-

en by two different sources: the history of the dialogue, namely the 4 previous Dia-

logue Moves, is taken by the dialogue logs and the current hypothesis' semantic parse

which is generated on the fly by the same keyword parser used in the automatic la-

belling.

User Simulation score is the probability that the current hypothesis' Dialogue

Move has really been said by the user given the 4 previous Dialogue Moves.

The advantage of this feature has been discussed in section 1.3.

2.4 System Architecture

The system developed in the context of this study is implemented mainly in JAVA,

with the exception of the parts that interact with the User Simulation model of K.

Georgila and the TiMBL classifier (Daelemans et al., 2007) that were written in

C/C++ and Java Native Interface (JNI). In Figure 2.6 we can see an overview of the

system's architecture. Currently the system works in off-line mode, i.e. gets its input

from flat files that comprise the Edinburgh TownInfo corpus and performs re-ranking

of an n-best list, i.e. outputs the hypothesis that has the label with the highest degree

of confidence along with this very label. For evaluation purposes it currently com-

putes the DMA, SA and WER of the training set with 10-fold cross-validation as its

output.

However, an OAA wrapper has been included as well in order to enable it to work in

23

a real time environment, where the input shall be given directly by the speech recog-

niser and the dialogue logger and its output will be given as input to the dialogue

manager.

Figure 2.6: The system's architecture

A brief description of the individual components follows:

• The keyword parser was originally written in C by K. Georgila and has been

adapted to Java by Neil Mayo, the version of whom I included in my system.

The keyword parser reads a vocabulary file which contains a simple mapping

from the various domain-specific words of interest that are met in the tran-

scripts to an intermediate reduced vocabulary that will be used by a pattern

matcher. It then reads a file that contains all the patterns of the reduced

Top Hypothesis and Label

Re-Ranker TiMBL

User Simulation

Keyword Parser

Feature Extractor

N-best transcriptions

n-1 hypotheses'Dialogue Moves n-th hypothesis

Dialogue MoveUser Simulation score

Feature Vectors

Labels

Edinburgh TownInfo Dialogue Corpus

24

vocabulary and maps these to {<Speech Act><Task>} pairs. Note that the ori-

ginal pattern files included with the original version of the parser by N. Mayo

mapped the vocabulary to a different semantic representation. However, these

were not considered to be helpful since I wanted to keep the same formalism

adopted in the ISU logs that already existed.

• The User Simulation is written in C by K. Georgila and is ported to my sys-

tem via JNI. Originally, K. Georgila has written the User Simulation as an

OAA agent but the first experiments that were conducted using this version

were rather inefficient in terms of runtime. The reason for that was that the

OAA itself was inducing unwanted overhead due to possibly the large size of

the messages that were transferred between my system and the agent. As a

result I wrote a JNI wrapper in the original C code that interfaces its three

main functions: load the model from n-grams stored in flat files to memory,

simulate the user action given the n-1 history and kill the model.

It should be noted that originally the User Simulation was trained using both

the Cambridge and Edinburgh TownInfo corpus resulting in a total of 461 dia-

logues with 4505 dialogues. These were stored as mentioned above as n-

grams in flat files produced by the CMU-Cambridge Statistical Language

Model Toolkit v.2 using absolute discounting for smoothing the n-gram prob-

abilities. Since I am also using the Edinburgh TownInfo for training and test-

ing TiMBL (Daelemans et al., 2007) I had to reduce the training dataset given

to the User Simulation to avoid having it get trained on test data of my system

as well. As a result, I have subtracted the separate part of the Edinburgh

TownInfo corpus consisting of 58 dialogues containing 510 utterances, that

was used to test TiMBL (Daelemans et al., 2007) classifier and re-calculated

the n-grams.

• The feature extractor is the core module of my system written fully in JAVA

and is responsible for reading the Edinburgh TownInfo corpus from the vari-

ous flat files that make it up and extracts the features that were described in

detail in the previous section. The output of this module is the training and

testing dataset in ARFF format since it was considered convenient to visualise

25

and measure the Information Gain from WEKA (http://www.c-

s.waikato.ac.nz/ml/weka/). This format can also be read by TiMBL (Daele-

mans et al., 2007).

• TiMBL (Daelemans et al., 2007) is written purely in C++ and usually runs in

standalone mode. However, it provides a rather convenient API that enables

other software to integrate it quite easily in their work flow. Since my system

is written in JAVA I wrote a JNI wrapper for that as well, porting main API

calls, namely load the model from a flat file in a tree-based format, train the

model, test a flat file against a trained model, predict the class of a single vec-

tor given a trained model and kill the model.

The input to TiMBL is a set of feature vectors with a combination of real-val-

ued numbers, integers and a single boolean attribute. The classifier itself per-

forms internally a conversion of the numeric attributes to discrete ones using

a default of 20 classes. The output is a set of labels that are attributed to each

input vector. Note that TiMBL completely ignores the fact that the input vec-

tors actually correspond to hypothesis in an n-best list, in other words each

vector is fully independent from the others. It is the responsibility of the Fea-

ture Extractor and the Re-ranker to keep track of the position of each vector

in a dialogue, n-best list and mapping to a single hypothesis.

TiMBL was trained using different parameter combinations mainly choosing

between number of k-nearest neighbours (1 to 5) and distance metrics

(Weighted Overlap and Modified Value Difference Metric). Quite surpris-

ingly, there was not any significant gain from using parameter combinations

other than the default, namely Weighted Overlap with k = 1 neighbours.

• The Re-ranker is written in JAVA and takes as input the labels that have been

assigned to each hypothesis of the n-best list under investigation and returns

the hypothesis according to the following algorithm along with the corres-

ponding label in the hope that it will assist the dialogue manager's strategies

(adapted by Gabsdil and Lemon 2004):

1. Scan the list of classified n-best recognition hypotheses

http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/

26

top-down. Return the first result that is classified as

'opt'.

2. If 1. fails, scan the list of classified n-best recognition

hypotheses top-down. Return the first result that is clas-

sified as 'pos'.

3. If 2. fails, count the number of neg’s and ign’s in the

classified recognition hypotheses. If the number of neg’s

is larger or equal than the number of ign’s then return the

first 'neg'.

4. Else return the first 'ign' utterance.

2.5. Experiments

In this study the experiments were conducted in two layers: the first layer concerns

only the classifier, i.e. the ability of the system to correctly classify each hypothesis

to either of the four labels 'opt', 'pos', 'neg', 'ign' and the second layer the re-ranker,

i.e. the ability of the system to boost the speech recogniser's accuracy.

For the first layer, I trained the TiMBL classifier using the Weighted Overlap metric

and k = 1 nearest neighbours (as discussed in the previous section) on 75 % of the

Edinburgh TownInfo corpus consisting of 126 dialogues containing 1554 user utter-

ances. For each utterance correspond 60-best lists produced by the recogniser result-

ing in a total of 93240 hypotheses.

Using this corpus, I performed a series of experiments using different sets of features

in order to both determine and illustrate the increasing performance of the classifier.

These sets were determined not only by the literature but also by the Information

Gain measures that were calculated on the training set using WEKA, as shown in

Figure 2.7).

27

Figure 2.7: Information Gain for all 13 attributes (measured using WEKA)

Quite surprisingly, we can notice that the rank being given by the Information Gain

measure coincides perfectly with the logical grouping of the attributes that was ini-

tially performed (see section 2.3). As a result, I chose to stick to this very grouping as

the final 4 feature sets on which the experiments on the classifier were performed

with the following order:

1. List Features

2. List Features + Current Hypothesis Features

3. List Features + Current Hypothesis Features + Acoustic Features

4. List Features + Current Hypothesis Features + Acoustic Features + User Sim-

ulation

Note that the User Simulation score seems to be a really strong feature, scoring first

in the Information Gain rank, validating our main hypothesis.

The testing of the classifier using each of the above feature sets was performed on

the remaining 25 % of the Edinburgh TownInfo corpus comprising of 58 dialogues,

consisting of 510 utterances and taking the 60-best lists resulting in a total of 30600

InfoGain Attribute------------------------------------- 1.0324 userSimulationScore

0.9038 rmsAmp 0.8280 minAmp 0.8087 maxAmp

0.4861 parsability 0.3975 acousScore 0.3773 hypothesisDuration 0.2545 hypothesisLength 0.1627 avgConfScore 0.1085 minWordConfidence

0.0511 nBestRank 0.0447 standardDeviation 0.0408 matchesFrequentDM

28

vectors. In each experiment I measured Precision, Recall, F-measure per class and

total Accuracy of the classifier .

For the second layer, I used a trained instance of the TiMBL classifier on the 4th fea-

ture set (List Features + Current Hypothesis Features + Acoustic Features + User

Simulation) and performed re-ranking using the algorithm illustrated in the previous

section on the same training set used in the first layer using 10-fold cross validation.

2.6. Baseline and Oracle

For the first layer I chose as a baseline the scenario when the most frequent label,

'neg', would be chosen in every case for the four-way classification.

For the second layer I chose as a baseline the normal speech recogniser's behaviour,

i.e. giving as output the topmost hypothesis. As an oracle for the system I defined the

choice of either the first 'opt' in the n-best list to be classified or if this does not exist

the first 'pos' in the list. In this way it is guaranteed that we shall always get as output

a perfect match to the true transcript as far as its Dialogue Move is concerned,

provided there exists a perfect match somewhere in the list.

29

CHAPTER 3

Results

As explained in chapter 2, I performed two series of experiments in two layers: the

first corresponds to the training of the classifier alone and the second to the system as

a whole measuring the re-ranker's output. A brief summary of the method follows:

• First Layer – Classifier Experiments

• Baseline

• List Features (LF)

• List Features + Current Hypothesis Features (LF + CHF)

• List Features + Current Hypothesis Features + Acoustic Features (LF

+ CHF + AF)

• List Features + Current Hypothesis Features + Acoustic Features +

User Simulation (LF + CHF + AF + US)

• Second Layer – Re-ranker Experiments

• Baseline

• 10-fold cross-validation

• Oracle

All results reported in this chapter are drawn from the TiMBL classifier which is be-

ing trained with the Weighted Overlap metric and k = 1 nearest neighbours settings.

Both layers are trained on the same Edinburgh TownInfo Corpus of 126 dialogues

containing 1554 user utterances or a total of 93240 hypotheses. The first layer was

tested against a separate Edinburgh TownInfo Corpus of 58 dialogues containing 510

user utterances or a total of 30600 hypotheses, while the second was tested on the

30

whole training set with 10-fold cross-validation.

3.1 First Layer – Classifier Experiments

In these series of experiments I measure precision, recall and F1-measure for each of

the four labels and overall F1-measure and accuracy of the classifier. In order to have

a better view of the classifier's performance I have also included the confusion

matrices for the final experiment with all 13 attributes which scores better than the

rest.

Table 3.1-3.4 show per class and per attribute set measures, while Table 3.5 shows a

collective view of the results for the four sets of attributes and the baseline being the

majority class label 'neg'. Table 3.6 shows the confusion matrix for the final experi-

ment.

Feature set (opt) Precision Recall F1-MeasureLF 42.50% 58.41% 49.20%LF+CHF 62.35% 65.71% 63.99%

LF+CHF+AF 55.59% 61.59% 58.43%

LF+CHF+AF+ US 70.51% 73.66% 72.05%

Table 3.1 Precision, Recall and F1-Measure for the 'opt' category

Feature set (pos) Precision Recall F1-MeasureLF 25.18% 1.72% 3.22%LF+CHF 51.22% 57.37% 54.11%

LF+CHF+AF 51.52% 54.60% 53.01%

LF+CHF+AF+ US64.79% 61.80% 63.26%

Table 3.2 Precision, Recall and F1-Measure for the 'pos' category

31

Feature set (neg) Precision Recall F1-MeasureLF 54.20% 96.36% 69.38%LF+CHF 70.70% 74.95% 72.77%

LF+CHF+AF 69.50% 73.37% 71.38%

LF+CHF+AF+ US 85.61% 87.03% 86.32%

Table 3.3 Precision, Recall and F1-Measure for the 'neg' category

Feature set (ign) Precision Recall F1-MeasureLF 19.64% 1.31% 2.46%LF+CHF 63.52% 48.72% 55.15%

LF+CHF+AF 59.30% 48.90% 53.60%

LF+CHF+AF+ US 99.89% 99.93% 99.91%

Table 3.4 Precision, Recall and F1-Measure for the 'ign' category

Feature set F1-MeasureAccuracyBaseline - 51.08%LF 37.31% 53.07%LF+CHF 64.06% 64.77%

LF+CHF+AF 62.63% 63.35%

LF+CHF+AF+ US 86.03% 84.90%

Table 3.5: F1-Measure and Accuracy for the four attribute sets.

In tables 3.1 – 3.5 we generally notice an increase in precision, recall and F1-meas-

ure as we progressively add more attributes to the system with the exception of the

addition of the Acoustic Features which seem to impair the classifier's performance.

We also make note of the fact that in the case of the 4th attribute set the classifier can

distinguish very well the 'neg' and 'ign' categories with 86.32% and 99.91% F1-meas-

ure respectively.

Most importantly, we take a remarkable boost in F1-measure and accuracy with the

addition of the User Simulation score. We mark a 37.36% relative increase in F1-

measure and 34.02% increase in the accuracy compared to the 3rd experiment, which

32

contains all but the User Simulation score attribute and a 66.20% relative increase of

the accuracy compared to the Baseline.

In table 3.4 we make note of a considerably low recall measure for the 'ign' category

in the case of the LF experiment, suggesting that the list features do not add extra

value to the classifier, partially validating the Information Gain measure (Figure 2.7).

opt pos neg ign

opt 232 37 46 0

pos 47 4405 2682 8

neg 45 2045 13498 0

ign 5 0 0 7550

Table 3.6 Confusion Matrix for LF + CHF + AF + US set.

Taking a closer look to the 4th experiment with all 13 features we notice in table 3.6

that most errors occur between the 'pos' and 'neg' category. In fact, for the 'neg' cat-

egory the False Positive Rate (FPR) is 18.17% and for the 'pos' 8.9%, all in all a lot

larger than for the other categories.

3.2 Second Layer – Re-ranker Experiments

In these series of experiments I measure WER, DMA and SA for the system as a

whole. In order to make sure that the improvement noted was really attributed to the

classifier I computed the p-values for each of these measures using the Wilcoxon

signed rank test for the WER and McNemar chi-square test for the DMA and SA

measure.

WER DMA SA

Baseline 47.72% 75.05% 40.48%

Classifier 45.27% ** 78.22% * 42.26%

Oracle 42.16% *** 80.20% *** 45.27% ***

Table 3.7 WER, DMA and SA measures for the Baseline, Classifier and Oracle (*** indicates p < 0.001, ** indicates p < 0.01, * indicates p < 0.05)

33

In table 3.7 we note that the classifier scores 45.27% WER making a notable relative

reduction of 5.13% compared to the baseline and 78.22% DMA incurring a relative

improvement of 4.22%. The classifier scored 42.26% on SA but it was not con-

sidered significant compared to the baseline (0.05 < p < 0.10). Comparing the classi-

fier's performance with the Oracle it achieves a 44.06% of the possible WER im-

provement on this data, 61.55% for the DMA measure and 37.16% for the SA meas-

ure.

Finally, we also notice that the Oracle has a 80.20% for the DMA, which means that

19.80% of the n-best lists did not include at all a hypothesis that matched semantic-

ally to the true transcript.

3.3. Significance tests: McNemar's test & Wilcoxon test

McNemar’s test (Tan et al., 2001) is a statistical process that can validate the signific-

ance of differences between two classifiers on boolean data. Let fA be the baseline

and fB be our system. Given a pair of binary data (in our case the answers whether

the true transcript and the topmost hypothesis for the baseline or the output of the re-

ranker for our system match semantically and on a word level for the case of DMA

and SA measure respectively) we record the matches on each utterance both for fA

and fB simultaneously to construct the following contingency table:

Correct by fA Incorrect by fA

Correct by fB n00 n01

Correct by fB n10 n11

McNemar’s test is based on the idea that there is little information about the distribu-

tion with which both the baseline and the classifier get the correct results or for

which both get incorrect results; it is based entirely on the values of n01 and n10. Un-

der the null hypothesis (H0), the two algorithms should have the same error rate,

meaning n01 = n10. It is essentially a x2 test and performs a test using the following

statistic:

34

If the H0 is correct, then the probability that this number is bigger than x2 = 3.84 is

less than 0.05. So we may reject the H0 in favour of the hypothesis that the two al-

gorithms have different performance.

The Wilcoxon signed rank test is a statistical test for real-valued paired data that do

not follow a normal distribution as is the case with the WER distribution as can be

seen in Figure 3.1 below. I used MATLAB's version of the test signrank.

Figure 3.1: WER distribution for the re-ranker output

WER

Utt

eran

ces

35

CHAPTER 4

Experiment with high-level Features

The four sets of attributes already described in the previous chapters were chosen

based both on previous studies and on intuition and were also partially justified ac-

cording to the ranking produced by running the Information Gain measure on the Ed-

inburgh TownInfo corpus training set.

Apart from this more traditional approach in the feature selection process, I also

trained a Memory Based Classifier based only on the higher level features of merely

the User Simulation score and the Grammar Parsability (US + GP). The idea behind

this choice is to try and find a combination of features that ignores low level charac-

teristics of the user's utterances as well as features that heavily rely on the speech re-

cogniser and thus by default are not considered to be very trustworthy.

Quite surprisingly, the results taken from an experiment with just the User Simula-

tion score and the Grammar Parsability are very promising and comparable with

those acquired from the 4th experiment with all 13 attributes. Table 3.9 shows the pre-

cision, recall and F1-measure per label and table 3.10 illustrates the classifier's per-

formance in comparison with the 4th experiment.

Label Precision Recall F1-Measureopt 73.99% 64.13% 68.70%pos 76.29% 46.21% 57.56%

neg 81.87% 94.42% 87.70%

ign 99.99% 99.95% 99.97%

Table 3.9 Precision, Recall and F1-measure for the high-level features experiment

36

It can be derived from table 3.9 that there is a somewhat considerable decrease in the

recall and a corresponding increase in the precision of the 'pos' and 'opt' categories

compared to the LF + CHF + AF + US attribute set, which account for lower F1-

measures. However, all in all the US + GP set manages to classify correctly 207 more

vectors and quite interestingly commits far fewer ties1 and manages to resolve more

compared to the full 13 attribute set.

Feature set F1-Measure Accuracy Ties/Resolved/%

LF+CHF+AF+ US 86.03% 84.90% 4993 / 863 / 57.13%

US + GP 85.68% 85.58% 115/ 75 / 65.22%

Table 3.10 F1-measure, Accuracy and number of ties that were correctly resolved by TiMBL for the LF+CHF+AF+US and US+GP feature sets

Next, I performed an experiment on the re-ranker using the aforementioned classifier

and it did not achieve much compared to the Baseline for the DMA and SA measures

(it scored 74.85% DMA, 0.2% lower than the Baseline and 40.82% SA, 0.34% high-

er than the Baseline, both results being statistically insignificant). For the WER it

scored 46.39%, a relative decrease of 2.78% compared to the Baseline, achieving

23.92% of the possible WER improvement for this dataset.

Following the success of the previous experiment on the classifier alone, I took

things to their extremes and trained TiMBL with just the User Simulation score fea-

ture. Not surprisingly the classifier scored 80.60% overall F1-measure and 81.64%

accuracy, but it was unable to classify correctly any of the 'opt' hypotheses correctly.

As a result, it was not considered necessary to continue to check the performance of

the re-ranker with this rather minimal classifier.

1 In the case of k-nn algorithms we might come across situations when a particular vector is found to be equidistant from two or more neighbours that belong to different classes. In this case a particular tie-resolving scheme is adopted such as weighted voting.

37

CHAPTER 5

Discussion and Conclusions

In this chapter we shall discuss the methodology applied as a whole and the results

that were drawn from the experiments on the Edinburgh TownInfo corpus and

present some overall conclusions. The results especially for the second layer of ex-

periments are limited by three major reasons:

1. The speech recogniser's performance. The oracle score for the DMA measure

shows that approximately 19.80% of the n-best lists do not contain a hypo-

thesis that matches semantically with the true transcript. This partly resulted

in a dataset which is highly imbalanced (Figure 5.1) and impaired the classifi-

er's separability. According to Andersson (2006) there are two causes for this

problem:

• Mis-timed recognition – where the microphone was not activated in due

time before the user started speaking and/or was deactivated before the

user had finished speaking.

• Bad recognition hypotheses – where the user said something clearly but

the system failed to recognise it. This can be ascribed to decoding para-

meters and deficiency of the language model to include domain-specific

vocabulary.

2. The problem we are trying to solve is somewhat trivial from a semantic point

of view. In order to compute the labels and measure the DMA of each hypo-

thesis I have used a keyword parser (see section 2.4), which “translates” each

sentence to the format {<Speech Act><Task>} pair. While this high level of

representation seems to be enough for the User Simulation model, it seems as

though we let even highly ungrammatical hypotheses align semantically with

38

the true transcript. Although this assumption can be justified by the fact that

in dialogue systems we are interested in the messages to be conveyed rather

than the exact way they have been uttered by the user, we have artificially in-

creased in this way the baseline's DMA, in other words the erroneous topmost

hypotheses semantically align with the true transcript.

Figure 5.1 Label histogram for the Edinburgh TownInfo training set

5.1. Automatic Labelling

The labelling used in this study is closely related to that devised by Gabsdil and

Lemon (2004) and Jonson (2006). The main idea is to map each hypothesis to a class

that categorises it primarily from a semantic point of view and secondarily taking

into account the WER. This is being done under the notion that Dialogue Systems are

sensitive to the meaning behind the user's utterance rather than the grammaticality of

the utterance.

However, the method adopted in order to ascribe a label to each hypothesis as de-

scribed in section 2.2 augmented the problem of the imbalanced dataset as mentioned

in the introduction of this chapter (Figure 5.1). The 'neg' category includes both se-

mantically aligned hypotheses with high WER (>50) and hypotheses which do not

align but are addressed to the system and thus are distinguished from the 'ign'. This

opt pos neg ign

Hyp

othe

ses

opt pos neg ign

39

of course gave a boost to the 'neg' category being the majority class with a grave nu-

merical difference from the rest.

A way to alleviate this problem would be to split the 'neg' category to two, namely

'pessimistic' ('pess') which would include semantically identical hypotheses with high

WER (>50) and 'neg' which would address semantically mismatched hypotheses. A

few preliminary experiments towards this directions were conducted but did not

achieve much accuracy and were left to be included in some future work.

5.2. Features

The features used in this study are divided into four groups which also reflect the

number of series of experiments that were conducted; list features, current hypothesis

features, acoustic features and User Simulation. All of the features mentioned with

the exception of the User Simulation score have been used in previous similar studies

with successful results, a strong justification for including them in my system as

well. However, as illustrated in chapter 4 not all of them contributed to the classifier's

performance.

The list features alone (1st experiment) did not make the classifier more significant

compared to the Baseline. It seems that at least in the Edinburgh TownInfo corpus we

cannot account for possible clusters of labels gathered in a specific place within the

list (something that the n-best rank feature or the standard deviation of confidence

scores for example could give rise to).

This phenomenon is particularly evident in the case of the 'ign' label, as shown in

table 3.4. This seems quite reasonable since in the case where an utterance is actually

crosstalk, then most of the times all the hypotheses in the n-best list are labelled as

'ign', rendering list-wise features such as n-best rank and standard deviation of con-

fidence score useless.

The current hypothesis features (2nd experiment) contributed significantly to the clas-

sifier's performance which was quite as expected though they included attributes

such as the speech recogniser's confidence and acoustic score which by default are

40

the foremost suspects for the mediocre performance of the 1-best top hypothesis

baseline system. The inclusion of the grammar parsability and the minimum word

confidence score seem to separate the hypotheses well especially between the 'opt' -

'pos' and 'neg' - 'ign' categories. In this way they validate the assumption that fairly

ungrammatical and/or marginally acceptable utterance recognitions (which might

have on average a high confidence score but some of the words that comprise it are

not recognised with much confidence by the acoustic model and/or language model,

e.g. utterances with wrong syntax) do not carry the correct semantic information

compared to the true transcript.

On the other hand, the addition of the acoustic features (3rd experiment) though

seemed promising from the Information Gain ranking (Figure 2.7) and literature,

they actually impaired the accuracy of the classifier (Table 3.5). This may be due to

the fact that the minimum, maximum and RMS amplitude values correspond to a

single wave file and thus is the same for all the hypotheses in an n-best list. From a

dataset point of view, we are essentially multiplying the probability mass of a certain

value for each of these attributes without them being unique. As a result, we are arti-

ficially boosting their importance which in its turn trick the Information Gain and

Gain Ratio measure which is being used internally by TiMBL. In other words, these

attributes score high in both measures because they have many occurrences of the

same values rather than being unique and therefore essentially useful.

The addition of the User Simulation score (4th experiment) gives a remarkable boost

to the classifier's performance, which validates the main hypothesis of this study as

far as the classification of each hypothesis to a certain label is concerned. What

strikes most is the fact that the User Simulation score helps the classifier distinguish

very clearly the 'ign' and then the 'neg' category, i.e. the categories which correspond

to hypotheses that mostly differ semantically from the true transcript or do not ad-

dress the system.

Especially in the case of the 'ign' category when the user does not address the system,

the User Simulation almost always models it very accurately. In other words, given a

history of 4 dialogue moves (User Simulation uses 5-grams) and the current being se-

mantically empty, {<[null]>,<[null]>}, it assigns it the highest probability it can give

41

(as shown in figure 5.2). This makes sense since if the user currently does not ad-

dress the system, then the dialogue that has preceded is rather fixed and thus can be

modelled easily. An equivalent justification exists in the case when the user says

something that does not align semantically with the true transcript and/or is erro-

neous and thus has caused the system in the past to respond in a fixed way. Bear in

mind that we consider a dialogue system, the responses and vocabulary of which

(and of the user as well) is rather limited.

Figure 5.2 Histogram of User Simulation score for the Ed TownInfo training set

On the other hand, in the case of the 'opt' and 'pos' categories the User Simulation is

less certain (Figure 5.2) for the exactly opposite reason as with the case of the 'ign'

and 'neg' categories. In the case of correctly recognised hypotheses the dialogue

between the system and the user may progress rather quickly in the sense that the sys-

tem does not need to explicitly or implicitly confirm the user's utterances. This

means that the course of the dialogue can be quite different and thus more difficult to

model ( {<[provide_info]>,<[hotel_price]>} can occur in many different contexts

compared to {<[null]>,<[null]>} ). This is partially validated by the additional exper-

iment performed where I trained TiMBL with just the User Simulation score as a fea-

ture and noticed that it was not able to classify correctly any of the 'opt' hypotheses.

42

5.3 Classifier

The TiMBL classifier seems to be rather well-suited for modelling dialogue context-

based speech recognition using the User Simulation as an extra feature. Though

every effort was made to keep as consistent a method for the feature selection and

model optimization as possible (Information Gain ranking), still I believe that the

classifier would benefit more from a more systematic-exhaustive search through all

the possible combinations of features and/or parameter settings, such as using the

“leave-one-out” method adopted by Gabsdil and Lemon (2004) and got an increase

in their classifier's accuracy by 9%.

The main drawback of our trained classifier was the high false positive rate for the

'pos' category. As is evident in table 3.6 the 'pos' category gets easily mistaken with

the 'neg' category. A possible cause for this is the fact that the hypotheses belonging

to the 'neg' category far outnumber those belonging to the 'pos', as described in the

introduction of this chapter. Another way to justify this phenomenon resides in the

fact that the 'neg' category includes semantically aligned (with high WER) hypo-

theses as is the case with the 'pos' category and thus most of the features used cannot

distinguish very well between the two classes in this case. For example the hypothes-

is' duration is the same even if the recogniser captures something semantically

aligned with the true transcription or not.

5.4 Results

In the first layer of experiments I managed to train a considerably efficient classifier

using all 13 attributes scoring 86.03% F1-measure and 84.90% accuracy. The User

Simulation score seems to be the key attribute that accounts for most of the classifi-

er's ability to separate among the four classes. In favour of this hypothesis is the extra

experiment performed towards the end of this study where I trained TiMBL with just

the User Simulation score and the Grammar Parser and scored 85.68% F1-measure

and 85.58% accuracy.

43

This latter experiment seems very promising in the sense that we can get acceptable

results with just two attributes, resulting in a very robust and efficient system. What

is more, the nature of these attributes being of higher level than the rest pose an inter-

esting argument as to the approach which should be followed in the post-processing

of speech recognition output. We should bear in mind though that both features can-

not be extracted directly from the n-best lists or the wave files of the user's utterances

but rather involve the application of models on the dialogue and the syntax of each

hypothesis. This means that we are essentially inducing time overhead to the system,

which in the case of a real time dialogue system is crucial. Dealing with 60-best lists

induces fairly acceptable overhead time for the User Simulation model which does

not have to account for too many different states as is the case of domain-specific

dialogue systems. However, this is not always the case especially in the use of a

wide-coverage grammar parser which sometimes has to deal with the parsing of long

sentences and may slow down the overall response of the system. Using a more do-

main-specific and efficient parser than the one used in this thesis shall effectively al-

leviate this problem.

In the second layer of experiments the performance of the re-ranker is equally en-

couraging as in the case of the classifier. The system achieved a relative reduction of

WER of 5.13%, a relative increase of DMA of 4.22%, and a relative increase of SA

of 4.40% with only the latter not being statistically significant (0.05 < p 0.10) com-

pared to the Baseline.

In the case of dialogue systems we are primarily interested in a gain in the DMA

measure, which would essentially mean that our re-ranker is helping the system to

better “understand” what the user really said and it seems that my system can im-

prove the performance of the speech recogniser. Even though, the increase is some-

what small compared to previous studies, still it shows that my system is robust

enough gaining 61.55% of the possible DMA improvement and the result is statistic-

ally significant. The same applies for the relative improvement of WER compared to

the Baseline, which altogether considers a 44.06% of the possible boost in the overall

performance of the speech recogniser.

A possible reason for not gaining very large increase in the results for the WER and

44

the DMA and a statistical insignificant improvement in the SA measure was the lim-

ited size of the test data. What is more, as described in the introduction of this

chapter, the problem we are trying to solve from a semantic point of view is rather

trivial resulting both in the Baseline having an already high DMA and SA and in a

very tight margin between the Baseline and the Oracle, leaving only a small improve-

ment to be achieved.

5.5. Future work

Some ideas for future work have already been mentioned in previous sections of this

chapter: the adoption of a fifth ('pess') label that shall split the 'neg' category and

therefore bring balance to the dataset, a more systematic search of the features' and

parameters' combination of TiMBL to be selected using the “leave-one-out” approach

of Gabsdil and Lemon (2006), the use of a more domain-specific and more efficient

grammar parser.

Another useful improvement would be to use a more elaborate semantic parser than

the keyword parser I used in my system, which would not take into only the exist-

ence or not of certain key-words in each hypothesis but also some semantic function

among the uttered words. In this way we shall end up with a more difficult problem

left to be solved by the re-ranker, which would essentially reduce the accuracy of the

baseline, i.e. merely choosing the topmost hypothesis.

Finally, the current system is already implemented in a way that adheres to the OAA

practices and thus is very easy to integrate with a real dialogue system such as DIP-

PER and TownInfo dialogue system. In this way, we shall be able to evaluate the sys-

tem with truly unseen data and test it against the baseline system which relies on the

topmost 1-best hypothesis of the speech recogniser alone.

5.6. Conclusions

The system developed was tested in two layers, namely experiments that involved

the classifier alone and experiments that concerned the re-ranker. For the first layer I

45

conducted four experiments by training the classifier with an increasing number of

features:

• List Features (LF)

• List Features + Current Hypothesis Features (LF + CHF)

• List Features + Current Hypothesis Features + Acoustic Features (LF

+ CHF + AF)

• List Features + Current Hypothesis Features + Acoustic Features +

User Simulation (LF + CHF + AF + US)

Out of the four experiments the 4th gave the best results with 86.03% F1-measure and

84.90% accuracy, yielding a 66.70% relative increase of the accuracy compared to

the Baseline.

I also conducted an additional experiment where the classifier was trained to a lim-

ited set of high-level features, namely the User Simulation score and the Grammar

Parsability feature and scored 85.68% F1-measure and 85.58% accuracy, performing

a 67.54% relative increase of the accuracy compared to the Baseline.

For the second layer of experiments the re-ranker scored a relative reduction of WER

of 5.13%, a relative increase of DMA of 4.22%, and a relative increase of SA of

4.40% with only the latter not being statistically significant (0.05 < p 0.10) com-

pared to the Baseline. Comparing the re-ranker's performance with the Oracle it

achieved a 44.06% of the possible WER improvement on this data, 61.55% for the

DMA measure and 37.16% for the SA measure.

This study has shown that building a system that performs re-ranking of n-best lists

produced as an output from a speech recogniser module of a dialogue system can im-

prove the performance of the speech recogniser. It has also validated the main hypo-

thesis that the boost in the performance can be achieved to a considerable extent us-

ing a User Simulation model of the dialogues between the system and the user.

46

References

Andersson, S. (2006), “Context Dependent Speech Recognition”, MSc Dissertation,

University of Edinburgh, 2006.

Boros, M., Eckert, W., Gallwitz, F., Gorz, G., Hanrieder, G. and Niemann, H. (1996),

“Towards Understanding Spontaneous Speech: Word Accuracy vs. Concept Accur-

acy”, in Proceedings of International Symposium on Spoken Dialogue, ICSLP-96,

Philadelphia, USA, pp. 1005–1008.

Bos, J., Klein, E., Lemon, O. and Oka, T. (2003), “Dipper: Description and formal-

isation of an information-state update dialogue system architecture”, in 4th SIGdial

Workshop on Discourse and Dialogue, Sapporo, Japan, pp. 115–124.

Boyce, S., and Gorin, A. L. (1996), “User interface issues for natural spoken dia-

logue systems”, in Proceedings of International Symposium on Spoken Dialogue, pp.

65–68.

Cheyer, A. and Martin, D. (2001), “The open agent architecture”, Journal of

Autonomous Agents and Multi-Agent Systems 4(1), 143–148.

Chotimongkol, A. and Rudnicky, A. (2001), “N-best speech hypotheses reordering

using linear regression”, in Proceedings of EuroSpeech, pp. 1829–1832.

Cohen, W. (1996), “Learning trees and rules with set-valued features”, in Proceed-

ings of the Association for the Advancement of Artificial Intelligence, AAAI-96.

Daelemans, W., Zavrel, J., van der Sloot, K. and van den Bosch, A. (2007), “TiMBL:

Tilburg Memory Based Learner”, version 6.1 Reference Guide, ILK Technical Report

07-07.

Gabsdil, M. (2003), “Classifying Recognition Results for Spoken Dialogue

Systems”, in Proceedings of the Student Research Workshop at ACL–03.

Gabsdil, M. and Lemon, O. (2004), “Combining acoustic and pragmatic features to

47

predict recognition performance in spoken dialogue systems”, in Proceedings of

ACL, Barcelona, Spain, pp. 343–350.

Georgila, K., Henderson, J. and Lemon, O. (2006), “User Simulation for Spoken Dia-

logue Systems: Learning and Evaluation”, in Proceedings of the 9th International

Conference on Spoken Language Processing (INTERSPEECH–ICSLP-06), Pitts-

burgh, USA.

Gorin, A., L., Riccardi G. and Wright, J., H. (1997), “How may I help you?”, Journ-

al of Speech Communication, 23(1/2), pp. 113–127.

Gruenstein, A. (2008), “Response-Based Confidence Annotation for Spoken Dia-

logue Systems”, in Proceedings of the 9th SIGdial Workshop on Discourse and Dia-

logue, Columbus, Ohio, USA, pp. 11–20.

Gruenstein, A. and Seneff, S. (2007), “Releasing a multimodal dialogue system into

the wild: User support mechanisms”, in Proceedings of the 8th SIGdial Workshop on

Discourse and Dialogue, pp 111–119.

Gruenstein, A., Seneff S. and Wang C., (2006), “Scalable and portable web-based

multimodal dialogue interaction with geographical databases”, in Proceedings of

INTERSPEECH-06.

Jonson, R.(2006), “Dialogue context-based re-ranking of ASR hypothesis”, in

Spoken Language Technology Workshop, IEEE, Palm Beach, Aruba, pp. 174–177.

Lemon, O. (2004), “Context-sensitive speech recognition in ISU dialogue systems:

results for the grammar switching approach”, in Proceedings of the 8th Workshop on

the Semantics and Pragmatics of Dialogue, CATALOG-04.

Lemon, O., Georgila, K. and Henderson, J. (2006), “Evaluating effectiveness and

portability of reinforcement learned dialogue strategies with real users: The TALK

TownInfo evaluation”, in Spoken Language Technology Workshop, IEEE, pp. 178–

181.

Lemon, O., Georgila, K., Henderson, J. and Stuttle, M. (2006), “An ISU dialogue

system exhibiting reinforcement learning of dialogue policies: Generic slot-filling in

the talk in-car system”, in Proceedings of European Chapter of the ACL, EACL-06,

48

Trento, Italy, pp. 119–122.

Kamm, C., Litman, D. and Walker, M., A. (1998), “From novice to expert: The effect

of tutorials on user expertise with user dialogue systems”, in Proceedings of the In-

ternational Conference on Spoken Language Processing, ICSL-98.

Klein, D. and Manning, C., D. (2003), “Fast Exact Inference with a Factored Model

for Natural Language Parsing”, in Journal of Advances in Neural Information Pro-

cessing Systems 15, NIPS-02, Cambridge, MA, MIT Press, pp. 3–10.

Litman, D., Hirschberg, J. and Swerts, M. (2000), “Predicting automatic speech re-

cognition performance using prosodic cues”, in Proceedings of NAACL-00.

Litman, D. and Pan, S. (2000), “Predicting and adapting to poor speech recognition

in a spoken dialogue system”, in Proceedings of the Association for the Advancement

of Artificial Intelligence, AAAI-00, Austin, USA, pp. 722–728.

Pickering, M., J. and Garrod S. (2007), “Do people use language production to make predictions during comprehension?”, Journal of Trends in Cognitive Sciences, 11(3), pp.105–110.

Rudnicky, A., I., Bennett, C., Black, A., W., Chotimongkol, A., Lenzo, K., Oh, A. and

Singh R. (2000), “Task and Domain Specific Modeling in the Carnegie Mellon Com-

municator System”, in Proceedings of the International Conference on Spoken Lan-

guage Processing, ICSLP’00, Beijing, China.

Tan, C., M., Wang, Y., F. and Lee, C., D. (2001), “The use of bigrams to enhance text

categorization”, Journal of Information Processing & Management, 38(4), pp. 529–

546.

Walker M., Fromer, J., C. and Narayanan S. (1998), “Learning optimal dialogue

strategies: A case study of a spoken dialogue agent for email”, in Proceedings of the

36th Annual Meeting of the Association of Computational Lingustics, COLING/ACL

-98, pp. 1345–1352.

Walker, M., Wright, J. and Langkilde, I. (2000), “Using natural language processing

and discourse features to identify understanding errors in a spoken dialogue system”,

in Proceedings of International Conference on Machine Learning, ICML-00.

49

Young, S. (2004), “ATK An Application Toolkit for HTK”, 1.4.1 edn. Technical

Manual.

Dialogue Context-Based Speech Recognition using User Simulation

Documents

Transcript of Dialogue Context-Based Speech Recognition using User Simulation