PARADISE: A Framework for Evaluating Spoken Dialogue Agents

cmp-

lg/9

7040

04

15 A

pr 1

997

PARADISE: A Framework for Evaluating Spoken Dialogue Agents

Marilyn A. Walker, Diane J. Litman, Candace A. Kamm and Alicia AbellaAT&T Labs—Research

180 Park AvenueFlorham Park, NJ 07932-0971 USA

walker,diane,cak,[email protected]

Abstract

This paper presents PARADISE (PARAdigmfor DIalogue System Evaluation), a generalframework for evaluating spoken dialogueagents. The framework decouples task require-ments from an agent’s dialogue behaviors, sup-ports comparisons among dialogue strategies,enables the calculation of performance oversubdialogues and whole dialogues, specifiesthe relative contribution of various factors toperformance, and makes it possible to compareagents performing different tasks by normaliz-ing for task complexity.

1 Introduction

Recent advances in dialogue modeling, speech recogni-tion, and natural language processing have made it pos-sible to build spoken dialogue agents for a wide vari-ety of applications.1 Potential benefits of such agentsinclude remote or hands-freeaccess, ease of use, natu-ralness, and greater efficiency of interaction. However,a critical obstacle to progress in this area is the lack ofa general framework for evaluating and comparing theperformance of different dialogue agents.

One widely used approach to evaluation is based onthe notion of a reference answer (Hirschman et al., 1990).An agent’s responses to a query are compared with apredefined key of minimum and maximum reference an-swers; performance is the proportion of responses thatmatch the key. This approach has many widely acknowl-edged limitations (Hirschman and Pao, 1993; Danieli etal., 1992; Bates and Ayuso, 1993), e.g., although theremay be many potential dialogue strategies for carryingout a task, the key is tied to one particular dialogue strat-egy.

1We use the term agent to emphasize the fact that we areevaluating a speaking entity that may have a personality. Read-ers who wish to may substitute the word “system” wherever“agent” is used.

In contrast, agents using different dialogue strategiescan be compared with measures such as inappropri-ate utterance ratio, turn correction ratio, concept accu-racy, implicit recovery and transaction success (Danieliand Gerbino, 1995; Hirschman and Pao, 1993; Po-lifroni et al., 1992; Simpson and Fraser, 1993; Shriberg,Wade, and Price, 1992). Consider a comparison of twotrain timetable information agents (Danieli and Gerbino,1995), where Agent A in Dialogue 1 uses an explicit con-firmation strategy, while Agent B in Dialogue 2 uses animplicit confirmation strategy:

(1) User: I want to go from Torino to Milano.Agent A: Do you want to go from Trento to Milano?Yes or No?User: No.

(2) User: I want to travel from Torino to Milano.Agent B: At which time do you want to leave fromMerano to Milano?User: No, I want to leave from Torino in theevening.

Danieli and Gerbino found that Agent A had a highertransaction success rate and produced less inappropriateand repair utterances than Agent B, and thus concludedthat Agent A was more robust than Agent B.

However, one limitation of both this approach and thereference answer approach is the inability to generalizeresults to other tasks and environments (Fraser, 1995).Such generalization requires the identification of factorsthat affect performance (Cohen, 1995; Sparck-Jones andGalliers, 1996). For example, while Danieli and Gerbinofound that Agent A’s dialogue strategy produced dia-logues that were approximately twice as long as AgentB’s, they had no way of determining whether Agent A’shigher transaction success or Agent B’s efficiency wasmore critical to performance. In addition to agent factorssuch as dialogue strategy, task factors such as databasesize and environmental factors such as background noisemay also be relevant predictors of performance.

These approaches are also limited in that they cur-rently do not calculate performance over subdialogues aswell as whole dialogues, correlate performance with anexternal validation criterion, or normalize performancefor task complexity.

MEASURESQUALITATIVE

KAPPA

MEASURESEFFICIENCY

INAPPROPRIATE UTTERANCE RATIOAGENT RESPONSE DELAY

MINIMIZE COSTS

DIALOGUE TIMENUMBER UTTERANCES

ETC. ETC.REPAIR RATIO

SUCCESSMAXIMIZE TASK

MAXIMIZE USER SATISFACTION

Figure 1: PARADISE’s structure of objectives for spo-ken dialogue performance

This paper describes PARADISE, a general frame-work for evaluating spoken dialogue agents that ad-dresses these limitations. PARADISE supports compar-isons among dialogue strategies by providing a task rep-resentation that decoupleswhatan agent needs to achievein terms of the task requirements fromhow the agentcarries out the task via dialogue. PARADISE uses adecision-theoretic framework to specify the relative con-tribution of various factors to an agent’s overallperfor-mance. Performance is modeled as a weighted functionof a task-based success measure and dialogue-based costmeasures, where weights are computed by correlatinguser satisfaction with performance. Also, performancecan be calculated for subdialogues as well as whole di-alogues. Since the goal of this paper is to explain andillustrate the application of the PARADISE framework,for expository purposes, the paper uses simplified do-mains with hypothetical data throughout. Section 2 de-scribes PARADISE’s performance model, and Section 3discusses its generality, before concluding in Section 4.

2 A Performance Model for Dialogue

PARADISE uses methods from decision theory (Keeneyand Raiffa, 1976; Doyle, 1992) to combine a disparateset of performance measures (i.e., user satisfaction, task

success, and dialogue cost, all of which have been pre-viously noted in the literature) into a single performanceevaluation function. The use of decision theory requiresa specification of both the objectives of the decisionproblem and a set of measures (known as attributes indecision theory) for operationalizing the objectives. ThePARADISE model is based on the structure of objectives(rectangles) shown in Figure 1. The PARADISE modelposits that performance can be correlated with a mean-ingful external criterion such as usability, and thus thatthe overall goal of a spoken dialogue agent is to maxi-mize an objective related to usability. User satisfactionratings (Kamm, 1995; Shriberg, Wade, and Price, 1992;Polifroni et al., 1992) have been frequently used in theliterature as an external indicator of the usability of a di-alogue agent. The model further posits that two typesof factors are potential relevant contributors to user sat-isfaction (namely task success and dialogue costs), andthat two types of factors are potential relevant contribu-tors to costs (Walker, 1996).

In addition to the use of decision theory to create thisobjective structure, other novel aspects of PARADISEinclude the use of the Kappa coefficient (Carletta, 1996;Siegel and Castellan, 1988) to operationalize task suc-cess, and the use of linear regression to quantify the rel-ative contribution of the success and cost factors to usersatisfaction.

The remainder of this section explains the measures(ovals in Figure 1) used to operationalize the set of objec-tives, and the methodology for estimating a quantitativeperformance function that reflects the objective structure.Section 2.1 describes PARADISE’s task representation,which is needed to calculate the task-based success mea-sure described in Section 2.2. Section 2.3 describes thecost measures considered in PARADISE, which reflectboth the efficiency and the naturalness of an agent’s dia-logue behaviors. Section 2.4 describes the use of linearregression and user satisfaction to estimate the relativecontribution of the success and cost measures in a sin-gle performance function. Finally, Section 2.5 explainshow performance can be calculated for subdialogues aswell as whole dialogues, while Section 2.6 summarizesthe method.

2.1 Tasks as Attribute Value Matrices

A general evaluation framework requires a task represen-tation that decoupleswhatan agent and user accomplishfrom howthe task is accomplished using dialogue strate-gies. We propose that anattribute value matrix (AVM)can represent many dialogue tasks. This consists of theinformation that must be exchanged between the agentand the user during the dialogue, represented as a set of

ordered pairs of attributes and their possible values.2

As a first illustrative example, consider a simplifica-tion of the train timetable domain of Dialogues 1 and2, where the timetable only contains information aboutrush-hour trains between four cities, as shown in Table 1.This AVM consists of four attributes (abbreviations foreach attribute name are also shown).3 In Table 1, theseattribute-value pairs are annotated with the direction ofinformation flow to represent who acquires the informa-tion, although this information is not used for evaluation.During the dialogue the agent must acquire from the userthe values of DC, AC, and DR, while the user must ac-quire DT.

attribute possible values information flowdepart-city (DC) Milano, Roma, Torino, Trento to agentarrival-city (AC) Milano, Roma, Torino, Trento to agentdepart-range (DR) morning,evening to agentdepart-time (DT) 6am,8am,6pm,8pm to user

Table 1: Attributevalue matrix, simplified train timetabledomain

A1: Hello, This is Train Enquiry Service. DC,AC,DR,DTPlease speak after the tone. DC,AC,DR,DTWhich information do you need? DC,AC,DR,DT

U1: I want to go from Torino to Milano. DC,ACA2: Do you want to go from Trento to Milano? DC,AC

Yes or No? DC,ACU2: No. DC,ACA3: Do you want to leave from Trento? DC

Yes or No? DCU3: No. DCA4: Where do you want to leave from? DCU4: I want to leave from Torino. DCA5: Do you want to leave from Torino? DC

Yes or No? DCU5: Yes. DCA6: Do you want to go to Milano? ACU6: Yes. ACA7: At which time do you want to leave? DRU7: I want to travel in the evening. DRA8: Do you want to leave between 6 and 9 p.m.? DR

Yes or No? DRU8: Yes. DRA9: There is a train leaving at 8:00 p.m. DT

. . .

Figure 2: Agent A dialogue interaction (Danieli andGerbino, 1995)

Performance evaluation for an agent requires a corpusof dialogues between users and the agent, in which usersexecute a set of scenarios. Each scenario execution has acorresponding AVM instantiation indicating the task in-formation requirements for the scenario, where each at-tribute is paired with the attribute value obtained via the

2For infinite sets of values, actual values found in the exper-imental data constitute the required finite set.

3The AVM serves as an evaluation mechanism only. We arenot claiming that AVMs determine an agent’s behavior or serveas an utterance’s semantic representation.

B1: Hello, This is Train Enquiry Service. DC,AC,DR,DTPlease speak after the tone. DC,AC,DR,DTWhich information do you need? DC,AC,DR,DT

U1: I want to travel from Torino to Milano. DC,ACB2: At which time do you want to leave

from Merano to Milano? DC,AC,DRU2: No, I want to leave from Torino

in the evening. DC, DRB3: Do you want to leave from Torino

between 6 and 11 p.m? DC, DRPlease answer Yes or No. DC, DR

U3: Yes. DC, DRB4: A train leaves at 8 p.m. DT

. . .

Figure 3: Agent B dialogue interaction (Danieli andGerbino, 1995)

attribute actual valuedepart-city Torinoarrival-city Milanodepart-range eveningdepart-time 8pm

Table 2: Attribute value matrix instantiation, scenariokey for Dialogues 1 and 2

dialogue.For example, assume that a scenario requires the user

to find a train from Torino to Milano that leaves in theevening, as in the longer versions of Dialogues 1 and 2 inFigures 2 and 3.4 Table 2 contains an AVM correspond-ing to a “key” for this scenario. All dialogues resultingfrom execution of this scenario in which the agent andthe user correctly convey all attribute values (as in Fig-ures 2 and 3) would have the same AVM as the scenariokey in Table 2. The AVMs of the remaining dialogueswould differ from the key by at least one value. Thus,even though the dialogue strategies in Figures 2 and 3 areradically different, the AVM task representation for thesedialogues is identical and the performance of the systemfor the same task can thus be assessed on the basis of theAVM representation.

2.2 Measuring Task Success

Success at the task for a whole dialogue (or subdi-alogue) is measured by how well the agent and userachieve the information requirements of the task by theend of the dialogue (or subdialogue). This section ex-plains how PARADISE uses the Kappa coefficient (Car-letta, 1996; Siegel and Castellan, 1988) to operationalizethe task-based success measure in Figure 1.

The Kappa coefficient,κ, is calculated from a confu-sion matrix that summarizes how well an agent achievesthe information requirements of a particular task for a set

4These dialogues have been slightly modified from (Danieliand Gerbino, 1995). The attribute names at the end of eachutterance will be explained below.

KEYDEPART-CITY ARRIVAL-CITY DEPART-RANGE DEPART-TIME

DATA v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14v1 22 1 3v2 29v3 4 16 4 1v4 1 1 5 11 1v5 3 20v6 22v7 2 1 1 20 5v8 1 1 2 8 15v9 45 10

v10 5 40v11 20 2v12 1 19 2 4v13 2 18v14 2 6 3 21sum 30 30 25 15 25 25 30 20 50 50 25 25 25 25

Table 3: Confusion matrix, Agent A

KEYDEPART-CITY ARRIVAL-CITY DEPART-RANGE DEPART-TIME

DATA v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14v1 16 1 4 3 2v2 1 20 1 3v3 5 1 9 4 2 4 2v4 1 2 6 6 2 3v5 4 15 2 3v6 1 6 19v7 5 2 1 1 15 4v8 1 3 3 1 2 9 11v9 2 2 39 10

v10 6 35v11 20 5 5 4v12 10 5 5v13 5 5 10 5v14 5 5 11sum 30 30 25 15 25 25 30 20 50 50 25 25 25 25

Table 4: Confusion matrix, Agent B

of dialogues instantiating a set of scenarios.5 For exam-ple, Tables 3 and 4 show two hypothetical confusion ma-trices that could have been generated in an evaluation of100 complete dialogues witheach of two train timetableagents A and B (perhaps using the confirmation strate-gies illustrated in Figures 2 and 3, respectively).6 Thevalues in the matrix cells are based on comparisons be-tween the dialogue and scenario key AVMs. Wheneveran attribute value in a dialogue (i.e., data) AVMmatchesthe value in its scenario key, the number in the appro-priate diagonal cell of the matrix (boldface for clarity)is incremented by 1. The off diagonal cells representmisunderstandingsthat are not corrected in the dialogue.Note that depending on the strategy that a spoken dia-logue agent uses, confusions across attributes are possi-ble, e.g., “Milano ” could be confused with “morning.”The effect of misunderstandings thatare corrected dur-

5 Confusion matrices can be constructed to summarize theresult of dialogues for any subset of the scenarios, attributes,users or dialogues.

6The distributions in the tables were roughly based on per-formance results in (Danieli and Gerbino, 1995).

ing the course of the dialogue are reflected in the costsassociated with the dialogue, as will be discussed below.

The first matrix summarizes how the 100 AVMs rep-resenting each dialogue with Agent A compare withthe AVMs representing the relevant scenario keys, whilethe second matrix summarizes the information exchangewith Agent B. Labels v1 to v4 in each matrix representthe possible values of depart-city shown in Table 1; v5to v8 are for arrival-city, etc. Columns represent the key,specifying which information values the agent and userwere supposed to communicate to one another given aparticular scenario. (The equivalent column sums in bothtables reflects that users of both agents were assumed tohave performed the same scenarios). Rows represent thedata collected from the dialogue corpus, reflecting whatattributevalues were actually communicated between theagent and the user.

Given a confusion matrix M, success at achieving theinformation requirements of the task is measured withthe Kappa coefficient (Carletta, 1996; Siegel and Castel-

lan, 1988):

κ =P (A)− P (E)

1− P (E)

P(A) is the proportion of times that the AVMs for theactual set of dialogues agree with the AVMs for the sce-nario keys, and P(E) is the proportion of times that theAVMs for the dialogues and the keys are expected toagree by chance.7 When there is no agreement other thanthat which would be expected by chance,κ = 0. Whenthere is total agreement,κ = 1. κ is superior to othermeasures of success such as transaction success (Danieliand Gerbino, 1995), conceptaccuracy (Simpson andFraser, 1993), and percent agreement (Gale, Church, andYarowsky, 1992) becauseκ takes into account the inher-ent complexity of the task by correcting for chance ex-pected agreement. Thusκ provides a basis for compar-isons across agents that are performingdifferenttasks.

When the prior distribution of the categories is un-known, P(E), the expected chance agreement betweenthe data and the key, can be estimated from the distri-bution of the values in the keys. This can be calculatedfrom confusion matrix M, since the columns representthe values in the keys. In particular:

P (E) =n∑i=1

(tiT

)2

whereti is the sum of the frequencies in column i of M,andT is the sum of the frequencies in M (t1 + . . .+ tn).

P(A), the actual agreement between the data and thekey, is always computed from the confusion matrix M:

P (A) =∑ni=1 M(i, i)

T

Given the confusion matrices in Tables 3 and 4, P(E)= 0.079 for both agents.8 For Agent A, P(A) = 0.795andκ = 0.777, while for Agent B, P(A) = 0.59 andκ =0.555, suggesting that Agent A is more successful thanB in achieving the task goals.

2.3 Measuring Dialogue Costs

As shown in Figure 1, performance is also a function of acombination of cost measures. Intuitively, cost measuresshould be calculated on the basis of any user or agent

7κ has been used to measure pairwise agreement amongcoders making category judgments (Carletta, 1996; Krippen-dorf, 1980; Siegel and Castellan, 1988). Thus, the observeduser/agent interactions are modeled as a coder, and the idealinteractions as an expert coder.

8Using a single confusion matrix for all attributes as in Ta-bles 3 and 4 inflatesκwhen there are few cross-attribute confu-sions by making P(E) smaller. In some cases it might be desir-able to calculateκ first for identification of attributes and thenfor values within attributes, or to averageκ for each attribute toproduce an overallκ for the task.

dialogue behaviors that should be minimized. A widerange of cost measures have been used in previous work;these include pure efficiency measures such as the num-ber of turns or elapsed time to complete the task (Abella,Brown, and Buntschuh, 1996; Hirschman et al., 1990;Smith and Gordon, 1997; Walker, 1996), as well as mea-sures of qualitative phenomena such as inappropriate orrepair utterances (Danieli and Gerbino, 1995; Hirschmanand Pao, 1993; Simpson and Fraser, 1993).

PARADISE represents each cost measure as a func-tion ci that can be applied to any (sub)dialogue. First,consider the simplest case of calculating efficiency mea-sures over a whole dialogue. For example, letc1 be thetotal number of utterances. For the whole dialogue D1 inFigure 2,c1(D1) is 23 utterances. For the whole dialogueD2 in Figure 3,c1(D2) is 10 utterances.

To calculate costs over subdialogues and for someof the qualitative measures, it is necessary to be ableto specify which information goals each utterance con-tributes to. PARADISE uses its AVM representation tolink the information goals of the task to any arbitrarydialogue behavior, by tagging the dialogue with the at-tributes for the task.9 This makes it possible to evaluateany potential dialogue strategies for achieving the task,as well as to evaluate dialogue strategies that operate atthe level of dialogue subtasks (subdialogues).

UTTERANCES: A3...U5

UTTERANCES: A1..A9

UTTERANCES: U1...U6GOALS: DC, AC

SEGMENT: S2

GOALS: DC, AC, DR, DT SEGMENT: S1

GOALS: DC

UTTERANCES: A9

SEGMENT: S6

UTTERANCES: A7...U8

SEGMENT: S5GOALS: DR

GOALS: ACUTTERANCES: A6...U6

SEGMENT: S3 SEGMENT: S4

GOALS: DT

Figure 4: Task-defined discourse structure of Agent Adialogue interaction

9This tagging can be hand generated, or system generatedand hand corrected. Preliminary studies indicate that reliabilityfor human tagging is higher for AVM attribute tagging thanfor other types of discourse segment tagging (Passonneau andLitman, 1997; Hirschberg and Nakatani, 1996).

Consider the longer versions of Dialogues 1 and 2 inFigures 2 and 3. Each utterance in Figures 2 and 3 hasbeen tagged using one or more of the attribute abbre-viations in Table 1, according to the subtask(s) the ut-terance contributes to. As a convention of this type oftagging, utterances that contribute to the success of thewhole dialogue, such as greetings, are tagged with all theattributes. Since the structure of a dialogue reflects thestructure of the task (Carberry, 1989; Grosz and Sidner,1986; Litman and Allen, 1990), the tagging of a dialogueby the AVM attributes can be used to generate a hierar-chical discourse structure such as that shown in Figure 4for Dialogue 1 (Figure 2). For example, segment (subdi-alogue) S2 in Figure 4 is about both depart-city (DC) andarrival-city (AC). It contains segments S3 and S4 withinit, and consists of utterances U1. . . U6.

Tagging by AVM attributes is required to calculatecosts over subdialogues, since for any subdialogue, taskattributes define the subdialogue. For subdialogue S4 inFigure 4, which is about the attributearrival-city and con-sists of utterances A6 and U6,c1(S4) is 2.

Tagging by AVM attributes is also required to calcu-late the cost of some of the qualitative measures, such asnumber of repair utterances. (Note that to calculate suchcosts, each utterance in the corpus of dialogues must alsobe tagged with respect to the qualitative phenomenon inquestion, e.g. whether the utterance is a repair.10) Forexample, letc2 be the number of repair utterances. Therepair utterances in Figure 2 are A3 through U6, thusc2(D1) is 10 utterances andc2(S4) is 2 utterances. Therepair utterance in Figure 3 is U2, but note that accordingto the AVM task tagging, U2 simultaneously addressesthe information goals for depart-range. In general, if anutterance U contributes to the information goals of N dif-ferent attributes, each attribute accounts for 1/N of anycosts derivable from U. Thus,c2(D2) is .5.

Given a set ofci, it is necessary to combine the dif-ferent cost measures in order to determine their relativecontribution to performance. The next section explainshow to combineκ with a set ofci to yield an overall per-formance measure.

2.4 Estimating a Performance Function

Given the definition of success and costs above and themodel in Figure 1, performance for any (sub)dialogue Dis defined as follows:11

10Previous work has shown that this can be done with highreliability (Hirschman and Pao,1993).

11We assume an additive performance (utility) function be-cause it appears thatκ and the various cost factorsci are util-ity independent and additive independent (Keeney and Raiffa,1976). It is possible however that user satisfaction data col-lected in future experiments (or other data such as willingnessto pay or use) would indicate otherwise. If so, continuing use ofan additive function might require a transformation of the data,

Performance = (α ∗ N (κ)) −n∑i=1

wi ∗ N (ci)

Here α is a weight onκ, the cost functionsci areweighted bywi, andN is a Z score normalization func-tion (Cohen, 1995).

The normalization function is used to overcome theproblem that the values ofci are not on the same scale asκ, and that the cost measuresci may also be calculatedover widely varying scales (e.g. response delay couldbe measured using seconds while, in the example, costswere calculated in terms of number of utterances). Thisproblem is easily solved by normalizing each factorx toits Z score:

N (x) =x− xσx

whereσx is the standard deviation forx.

user agent US κ c1 (#utt) c2 (#rep)1 A 1 1 46 302 A 2 1 50 303 A 2 1 52 304 A 3 1 40 205 A 4 1 23 106 A 2 1 50 367 A 1 0.46 75 308 A 1 0.19 60 309 B 6 1 8 0

10 B 5 1 15 111 B 6 1 10 0.512 B 5 1 20 313 B 1 0.19 45 1814 B 1 0.46 50 2215 B 2 0.19 34 1816 B 2 0.46 40 18

Mean(A) A 2 0.83 49.5 27Mean(B) B 3.5 0.66 27.8 10.1

Mean NA 2.75 0.75 38.6 18.5

Table 5: Hypothetical performance data from users ofAgents A and B

To illustrate the method for estimating a performancefunction, we will use a subset of the data from Tables3 and 4, shown in Table 5. Table 5 represents the re-sults from a hypothetical experiment in which eight userswere randomly assigned to communicate with Agent Aand eight users were randomly assigned to communicatewith Agent B. Table 5 shows user satisfaction (US) rat-ings (discussed below),κ, number of utterances (#utt)and number of repair utterances (#rep) for each of theseusers. Users 5 and 11 correspond to the dialogues in Fig-ures 2 and 3 respectively. To normalizec1 for user 5, wedetermine thatc1 is 38.6 andσc1 is 18.9. Thus,N (c1) is-0.83. SimilarlyN (c1) for user 11 is -1.51.

To estimate the performance function, the weightsαandwi must be solved for. Recall that the claim implicit

a reworking of the model shown in Figure 1, or the inclusion ofinteraction terms in the model (Cohen, 1995).

in Figure 1 was that the relative contribution of task suc-cess and dialogue costs to performance should be calcu-lated by considering their contribution to user satisfac-tion. User satisfaction is typically calculated with sur-veys that ask users to specify the degree to which theyagree with one or more statements about the behavior orthe performance of the system. A single user satisfactionmeasure can be calculated from a single question, or asthe mean of a set of ratings. The hypothetical user satis-faction ratings shown in Table 5 range from a high of 6to a low of 1.

Given a set of dialogues for which user satisfaction(US),κ and the set ofci have been collected experimen-tally, the weightsα andwi can be solved for using multi-ple linear regression. Multiple linear regression producesa set of coefficients (weights) describing the relative con-tribution of each predictor factor in accounting for thevariance in a predicted factor. In this case, on the basisof the model in Figure 1, US is treated as the predictedfactor. Normalization of the predictor factors (κ andci)to their Z scores guarantees that the relative magnitudeof the coefficients directly indicates the relative contri-bution of each factor. Regression on the Table 5 data forboth sets of users tests which factorsκ, #utt, #rep moststrongly predicts US.

In this illustrative example, the results of the regres-sion with all factors included shows that onlyκ and #repare significant (p< .02). In order to develop a perfor-mance function estimate that includes only significantfactors and eliminates redundancies, a second regressionincluding only significant factors must then be done. Inthis case, a second regression yields the predictive equa-tion:

Performance = .40N (κ)− .78N (c2)

i.e., α is .40 andw2 is .78. The results also showκ issignificant at p< .0003, #rep significant at p< .0001,and the combination ofκ and #rep account for 92% ofthe variance in US, the external validation criterion. Thefactor #utt was not a significant predictor of performance,in part because #utt and #rep are highly redundant. (Thecorrelation between #utt and #rep is 0.91).

Given these predictions about the relative contributionof different factors to performance, it is then possibleto return to the problem first introduced in Section 1:given potentially conflicting performance criteria such asrobustness and efficiency, how can the performance ofAgent A and Agent B be compared? Given values forα andwi, performance can be calculated for both agentsusing the equation above. The mean performance of Ais -.44 and the mean performance of B is .44, suggestingthat Agent B may perform better than Agent A overall.

The evaluator must then however test these perfor-mance differences for statistical significance. In this

case, at test shows that differences are only significantat the p< .07 level, indicating a trend only. In this case,an evaluation over a larger subset of the user populationwould probably show significant differences.

2.5 Application to Subdialogues

Since bothκ andci can be calculated over subdialogues,performance can also be calculated at the subdialoguelevel by using the values forα and wi as solved forabove. This assumes that the factors that are predictive ofglobal performance, based on US, generalize as predic-tors of local performance, i.e. within subdialogues de-fined by subtasks, as defined by the attribute tagging.12

Consider calculating the performance of the dialoguestrategies used by train timetable Agents A and B, overthe subdialogues that repair the value of depart-city. Seg-ment S3 (Figure 4) is an example of such a subdialoguewith Agent A. As in the initial estimation of a perfor-mance function, our analysis requires experimental data,namely a set of values forκ andci, and the applicationof the Z score normalization function to this data. How-ever, the values forκ and ci are now calculated at thesubdialogue rather than the whole dialogue level. In ad-dition, only data from comparable strategies can be usedto calculate the mean and standard deviation for normal-ization. Informally, a comparable strategy is one whichapplies in the same state and has the same effects.

For example, to calculateκ for Agent A over the sub-dialogues that repair depart-city, P(A) and P(E) are com-puted using only the subpart of Table 3 concerned withdepart-city. For Agent A, P(A) = .78, P(E) = .265, andκ = .70. Then, this value ofκ is normalized using datafrom comparable subdialogues with both Agent A andAgent B. Based on the data in Tables 3 and 4, the meanκ is .515 andσ is .261, so thatN (κ) for Agent A is .71.

To calculatec2 for Agent A, assume that the averagenumber of repair utterances for Agent A’s subdialoguesthat repair depart-city is 6, that the mean over all compa-rable repair subdialogues is 4, and the standard deviationis 2.79. ThenN (c2) is .72.

Let Agent A’s repair dialogue strategy for subdia-logues repairing depart-city be RA and Agent B’s repairstrategy for depart-city be RB. Then using the perfor-mance equation above, predicted performance for RA is:

Performance(RA) = .40 ∗ .71− .78 ∗ .72 = −0.28

For Agent B, using the appropriate subpart of Table4 to calculateκ, assuming that the average number ofdepart-city repair utterances is 1.38, and using similar

12This assumption has a sound basis in theories of dialoguestructure (Carberry, 1989; Grosz and Sidner, 1986; Litman andAllen, 1990), but should be tested empirically.

calculations, yields

Performance(RB) = .40 ∗ −.71− .78 ∗ −.94 = 0.45

Thus the results of these experiments predict that whenan agent needs to choose between the repair strategy thatAgent B uses and the repair strategy that Agent A usesfor repairing depart-city, it should use Agent B’s strategyRB, since the performance(RB) is predicted to be greaterthan the performance(RA).

Note that the ability to calculate performance oversubdialogues allows us to conduct experiments that si-multaneously test multiple dialogue strategies. For ex-ample, suppose Agents A and B had different strate-gies for presenting the value of depart-time (in additionto different confirmation strategies). Without the abil-ity to calculate performance over subdialogues, it wouldbe impossible to test the effect of the different presen-tation strategies independently of the different confirma-tion strategies.

2.6 Summary

We have presented the PARADISE framework, and haveused it to evaluate two hypothetical dialogue agents in asimplified train timetable task domain. We used PAR-ADISE to derive a performance function for this task, byestimating the relative contribution of a set of potentialpredictors to user satisfaction. The PARADISE method-ology consists of the following steps:

• definition of a task and a set of scenarios;

• specification of the AVM task representation;

• experiments with alternate dialogue agents for thetask;

• calculation of user satisfaction using surveys;

• calculation of task success usingκ;

• calculation of dialogue cost using efficiency andqualitative measures;

• estimation of a performance function using linearregression and values for user satisfaction,κ and di-alogue costs;

• comparison with other agents/tasks to determinewhich factors generalize;

• refinement of the performance model.

Note that all of these steps are required to develop theperformance function. However once the weights in theperformance function have been solved for, user satisfac-tion ratings no longer need to be collected. Instead, pre-dictions about user satisfaction can be made on the basis

of the predictor variables, as illustrated in the applicationof PARADISE to subdialogues.

Given the current state of knowledge, it is important toemphasize that researchers should be cautious about gen-eralizing a derived performance function to other agentsor tasks. Performance function estimation should bedone iteratively over many different tasks and dialoguestrategies to see which factors generalize. In this way,the field can make progress on identifying the relation-ship between various factors and can move towards morepredictive models of spoken dialogue agent performance.

3 Generality

In the previous section we used PARADISE to eval-uate two confirmation strategies, using as examplesfairly simple information access dialogues in the traintimetable domain. In this section we demonstrate thatPARADISE is applicable to a range of tasks, domains,and dialogues, by presenting AVMs for two tasks involv-ing more than information access, and showing how ad-ditional dialogue phenomena can be tagged using AVMattributes.

attribute possible values information flowdepart-city (DC) Milano, Roma, Torino, Trento to agentarrival-city (AC) Milano, Roma, Torino, Trento to agentdepart-range (DR) morning,evening to agentdepart-time (DT) 6am,8am,6pm,8pm to userrequest-type (RT) reserve, purchase to agent

Table 6: Attribute value matrix, train timetable domainwith requests

First, consider an extension of the train timetable task,where an agent can handle requests to reserve a seat orpurchase a ticket. This task could be represented usingthe AVM in Table 6 (an extension of Table 1), where theagent must now acquire the value of the attribute request-type, in order to know what to do with the other informa-tion it has acquired.

U1: I want to go from Torino to Roma DC,ACC1: Approximately what time of day would you like to travel? DRU2: What are the options? DRC2: Morning or evening. DRU3: Are those departure times? DRC3: Yes. DRU4: I’d like to leave in the morning. DRC4: Train 702 leaves Torino Porto at 8 a.m. DTU5: Please reserve me a seat on that train. RT

Figure 5: Hypothetical Agent C dialogue interaction

Figure 5 presents a hypothetical dialogue in this ex-tended task domain, and illustrates user utterance typesand an agent dialogue strategy that are very differentfrom those in Figures 2 and 3. First, Agent C in Figure 5

uses a “no confirmation” dialogue strategy, in contrast tothe explicit and implicit confirmation strategies used inFigures 2 and 3. Second, Figure 5 illustrates new typesof userutterances that do not directly further the infor-mational goals of the task. In U2, the user asks the agenta wh-question about the DR attribute itself, rather thanproviding information about that attribute’s value. SinceU2 satisfies a knowledge precondition related to answer-ing C1, U2 contributes to the DR goal and is tagged assuch. In U3, the user similarly asks a yes-no question thataddresses a subgoal related to answering C1. Finally, U5illustrates a user request for an agent action, and is taggedwith the RT attribute. The value of RT in the AVM in-stantiation for the dialogue would be “reserve.”

Second, consider the very different domain and taskof diagnosing a fault and repairing a circuit (Smith andGordon, 1997). Figure 6 presents one dialogue from thisdomain.

1. C: This is the circuit fix it shop. ID,CB,RB,FT,FC,THow may I help you? ID,CB,RB,FT,FC,T

2. U: I want to fix a circuit. ID,CB,RB,FT,FC,T3. C: What is the ID of the circuit? ID4. U: Rs111. ID5. C: I am familiar with that circuit. CB

The LED is supposed to be displayingalternately flashing one and seven. CB

6. U: The LED is off. RB7. C: What is the switch at when the LED is off? RB8. U: The switch is down. RB9. C: Put the switch up. RB

10. U: Okay. RB11. C: Put the knob to one zero. RB12. U: Okay. RB13. C: What is the LED displaying? RB14. U: LED is displaying only a flashing seven. RB15. C: Is there a wire between connector eight four

and connector nine nine? FT16. U: No. FT17. C: Add a wire between connector eight four

and connector nine nine. FC18. U: Done. FC19. C: What is the LED displaying? T20. U: Alternately flashing one and seven. T21. C: Is the one on the LED displaying for

a longer period of time? T22. U: No. T23. C: Is the seven on the LED displaying

for a longer period of time? T24. U: No. T25. C: Put the knob to zero. T26. U: Okay. T27. C: What is the LED displaying? T28. U: Alternately displaying one and seven. T29. C: Is the one on the LED displaying

for a longer period of time? T30. U: Yes. T31. C: Put the switch down. T32. U: The switch is down. T33. C: What is the LED displaying? T34. U: Nothing. T35. C: The circuit is working correctly. Good-bye. T

Figure 6: A circuit domain dialogue (Smith and Gordon,1997), with AVM tagging

Smith and Gordon collected 144 dialogues for thistask, in which agent initiative was varied by using dif-

ferent dialogue strategies, and taggedeach dialogue ac-cording to the following subtask structure:13

• Introduction (I)—establish the purpose of the task

• Assessment (A)—establish the current behavior

• Diagnosis (D)—establish the cause for the errantbehavior

• Repair (R)—establish that the correction for the er-rant behavior has been made

• Test (T)—establish that the behavior is now correct

Our informational analysis of this task results in theAVM shown in Table 7. Note that the attributes are al-most identical to Smith and Gordon’s list of subtasks.Circuit-ID corresponds to Introduction, Correct-Circuit-Behavior and Current-Circuit-Behavior correspond toAssessment, Fault-Type corresponds to Diagnosis, Fault-Correction corresponds to Repair, and Test correspondsto Test. The attribute names emphasize information ex-change, while the subtask names emphasize function.

attribute possible valuesCircuit-ID (ID) RS111, RS112, ...Correct-Circuit-Behavior (CB) Flash-1-7, Flash-1, ...Current-Circuit-Behavior (RB) Flash-7Fault-Type (FT) MissingWire84-99, MissingWire88-99, ...Fault-Correction (FC) yes, noTest (T) yes, no

Table 7: Attribute value matrix, circuit domain

Figure 6 is tagged with the attributes from Table 7.Smith and Gordon’s tagging of this dialogueaccordingto their subtask representation was as follows: turns 1-4 were I, turns 5-14 were A, turns 15-16 were D, turns17-18 were R, and turns 19-35 were T. Note that thereare only two differences between the dialogue structuresyielded by the two tagging schemes. First, in our scheme(Figure 6), the greetings (turns 1 and 2) are tagged withall the attributes. Second, Smith and Gordon’s single tagA corresponds to two attribute tags in Table 7, which inour scheme defines an extra level of structure within as-sessment subdialogues.

4 Discussion

This paper presented the PARADISE framework forevaluating spoken dialogue agents. PARADISE is a gen-eral framework for evaluating spoken dialogue agentsthat integrates and enhances previous work. PARADISEsupports comparisons among dialogue strategies with atask representation that decoupleswhat an agent needs

13They report aκ of .82 for reliability of their taggingscheme.

to achieve in terms of the task requirements fromhowthe agent carries out the task via dialogue. Furthermore,this task representation supports the calculation of per-formance over subdialogues as well as whole dialogues.In addition, because PARADISE’s success measure nor-malizes for task complexity, it provides a basis for com-paring agents performingdifferenttasks.

The PARADISE performance measure is a function ofboth task success (κ) and dialogue costs (ci), and hasa number of advantages. First, it allows us to evaluateperformance at any level of a dialogue, sinceκ and cican be calculated for any dialogue subtask. Since per-formance can be measured over any subtask, and sincedialogue strategies can range over subdialogues or thewhole dialogue, we can associate performance with indi-vidual dialogue strategies. Second, because our successmeasureκ takes into account the complexity of the task,comparisons can be made across dialogue tasks. Third,κ allows us to measure partial success at achieving thetask. Fourth, performance can combine both objectiveand subjective cost measures, and specifies how to eval-uate the relative contributions of those costs factors tooverall performance. Finally, to our knowledge, we arethe first to propose using user satisfaction to determineweights on factors related to performance.

In addition, this approach is broadly integrative, in-corporating aspects of transaction success, concept accu-racy, multiple cost measures, and user satisfaction. In ourframework, transaction success is reflected inκ, corre-sponding to dialogues with a P(A) of 1. Our performancemeasure also captures information similar to concept ac-curacy, where low concept accuracy scores translate intoeither higher costs for acquiring information from theuser, or lowerκ scores.

One limitation of the PARADISE approach is that thetask-based success measure does not reflect that somesolutions might be better than others. For example, inthe train timetable domain, we might like our task-basedsuccess measure to give higher ratings to agents that sug-gest express over local trains, or that provide helpful in-formation that was not explicitly requested, especiallysince the better solutions might occur in dialogues withhigher costs. It might be possible to address this limita-tion by using the interval scaled data version ofκ (Krip-pendorf, 1980). Another possibility is to simply substi-tute a domain-specific task-based success measure in theperformance model forκ.

The evaluation model presented here has many ap-plications in apoken dialogue processing. We believethat the framework is also applicable to other dia-logue modalities, and to human-human task-oriented dia-logues. In addition, while there are many proposals in theliterature for algorithms for dialogue strategies that arecooperative, collaborative or helpful to the user (Webber

and Joshi, 1982; Pollack, Hirschberg, and Webber, 1982;Joshi, Webber, and Weischedel, 1984; Chu-Carrol andCarberry, 1995), very few of these strategies have beenevaluated as to whether they improve any measurable as-pect of a dialogue interaction. As we have demonstratedhere, any dialogue strategy can be evaluated, so it shouldbe possible to show that a cooperative response, or othercooperative strategy, actually improves task performanceby reducing costs or increasing task success. We hopethat this framework will be broadly applied in future di-alogue research.

5 Acknowledgments

We would like to thank James Allen, Jennifer Chu-Carroll, Morena Danieli, Wieland Eckert, GiuseppeDi Fabbrizio, Don Hindle, Julia Hirschberg, ShriNarayanan, Jay Wilpon, Steve Whittaker and threeanonymous reviews for helpful discussion and commentson earlier versions of this paper.

References

[Abella, Brown, and Buntschuh1996] Abella, Alicia,Michael K Brown, and Bruce Buntschuh. 1996. De-velopment principles for dialog-based interfaces. InECAI-96 Spoken Dialog Processing Workshop, Bu-dapest, Hungary.

[Bates and Ayuso1993] Bates, Madeleine and DamarisAyuso. 1993. A proposal for incremental dialogueevaluation. InProceedings of the DARPA Speech andNL Workshop, pages 319–322.

[Carberry1989] Carberry, S. 1989. Plan recognition andits use in understanding dialogue. In A. Kobsa andW. Wahlster, editors,User Models in Dialogue Sys-tems. Springer Verlag, Berlin, pages 133–162.

[Carletta1996] Carletta, Jean C. 1996. Assessing thereliability of subjective codings.Computational Lin-guistics, 22(2):249–254.

[Chu-Carrol and Carberry1995] Chu-Carrol, Jennifer andSandra Carberry. 1995. Response generation in col-laborative negotiation. InProceedings of the Confer-ence of the 33rd Annual Meeting of the Association forComputational Linguistics, pages 136–143.

[Cohen1995] Cohen, Paul. R. 1995.Empirical Methodsfor Artificial Intelligence. MIT Press, Boston.

[Danieli et al.1992] Danieli, M., W. Eckert,N. Fraser, N. Gilbert, M. Guyomard, P. Heisterkamp,M. Kharoune, J. Magadur, S. McGlashan, D. Sadek,J. Siroux, and N. Youd. 1992. Dialogue manager de-sign evaluation. Technical Report Project Esprit 2218SUNDIAL, WP6000-D3.

[Danieli and Gerbino1995] Danieli, Morena and Elisa-betta Gerbino. 1995. Metrics for evaluating dialoguestrategies in a spoken language system. InProceed-ings of the 1995AAAI Spring Symposium on EmpiricalMethods in Discourse Interpretation and Generation,pages 34–39.

[Doyle1992] Doyle, Jon. 1992. Rationality and its rolesin reasoning. Computational Intelligence, 8(2):376–409.

[Fraser1995] Fraser, Norman M. 1995. Quality stan-dards for spoken dialogue systems: a report onprogress in EAGLES. InESCA Workshop on SpokenDialogue Systems Vigso, Denmark, pages 157–160.

[Gale, Church, and Yarowsky1992] Gale, William,Ken W. Church, and David Yarowsky. 1992. Esti-mating upper and lower bounds on the performanceof word-sense disambiguation programs. InProc. of30th ACL, pages 249–256, Newark, Delaware.

[Grosz and Sidner1986] Grosz, Barbara J. and Can-dace L. Sidner. 1986. Attentions, intentions andthe structure of discourse.Computational Linguistics,12:175–204.

[Hirschberg and Nakatani1996] Hirschberg, Julia andChristine Nakatani. 1996. A prosodic analysis of dis-course segments in direction-giving monologues. In34th Annual Meeting of the Association for Computa-tional Linguistics, pages 286–293.

[Hirschman et al.1990] Hirschman, Lynette, Deborah A.Dahl, Donald P. McKay, Lewis M. Norton, and Mar-cia C. Linebarger. 1990. Beyond class A: A proposalfor automatic evaluation of discourse. InProceedingsof the Speech and Natural Language Workshop, pages109–113.

[Hirschman and Pao1993]Hirschman, Lynette and Christine Pao. 1993. Thecost of errors in a spoken language system. InPro-ceedings of the Third European Conference on SpeechCommunication and Technology, pages 1419–1422.

[Joshi, Webber, and Weischedel1984] Joshi, Aravind K.,Bonnie L. Webber, and Ralph M. Weischedel. 1984.Preventing false inferences. InCOLING84: Proc.10th International Conference on Computational Lin-guistics., pages 134–138.

[Kamm1995] Kamm, Candace. 1995. User interfacesfor voice applications. In David Roe and Jay Wilpon,editors,Voice Communication between Humans andMachines. National Academy Press, pages 422–442.

[Keeney and Raiffa1976] Keeney, Ralph and HowardRaiffa. 1976. Decisions with Multiple Objectives:Preferences and Value Tradeoffs. John Wiley andSons.

[Krippendorf1980] Krippendorf, Klaus. 1980.ContentAnalysis: An Introduction to its Methodology. SagePublications, Beverly Hills, Ca.

[Litman and Allen1990] Litman, Diane and James Allen.1990. Recognizing and relating discourse intentionsand task-oriented plans. In Philip Cohen, Jerry Mor-gan, and Martha Pollack, editors,Intentions in Com-munication. MIT Press.

[Passonneau and Litman1997] Passonneau, Rebecca J.and Diane Litman. 1997. Discourse segmentationby human and automated means.Computational Lin-guistics, 23(1).

[Polifroni et al.1992]Polifroni, Joseph, Lynette Hirschman, Stephanie Sen-eff, and Victor Zue. 1992. Experiments in evaluatinginteractive spoken language systems. InProceedingsof the DARPA Speech and NL Workshop, pages 28–33.

[Pollack, Hirschberg, and Webber1982] Pollack, Martha,Julia Hirschberg, and Bonnie Webber. 1982. Userparticipation in the reasoning process of expert sys-tems. InProceedings First National Conference onArtificial Intelligence, pages pp. 358–361.

[Shriberg, Wade, and Price1992] Shriberg, Eliza-beth, Elizabeth Wade, and Patti Price. 1992. Human-machine problem solving using spoken language sys-tems (SLS): Factors affecting performance and usersatisfaction. InProceedings of the DARPA Speech andNL Workshop, pages 49–54.

[Siegel and Castellan1988] Siegel, Sidney and N. J.Castellan. 1988.Nonparametric Statistics for the Be-havioral Sciences. McGraw Hill.

[Simpson and Fraser1993] Simpson, A. and N. A. Fraser.1993. Black box and glass box evaluation of the SUN-DIAL system. InProceedings of the Third EuropeanConference on Speech Communication and Technol-ogy, pages 1423–1426.

[Smith and Gordon1997] Smith, Ronnie W. andSteven A. Gordon. 1997. Effects of variable initia-tive on linguistic behavior in human-computer spokennatural language dialog.Computational Linguistics,23(1).

[Sparck-Jones and Galliers1996] Sparck-Jones,Karen and Julia R. Galliers. 1996.Evaluating Nat-ural Language Processing Systems. Springer.

[Walker1996] Walker, Marilyn A. 1996. The Effect ofResource Limits and Task Complexity on Collabora-tive Planning in Dialogue.Artificial Intelligence Jour-nal, 85(1–2):181–243.

[Webber and Joshi1982] Webber, Bonnie and AravindJoshi. 1982. Taking the initiative in natural languagedatabase interaction: Justifying why. InColing 82,pages 413–419.

PARADISE: A Framework for Evaluating Spoken Dialogue Agents

Documents

Transcript of PARADISE: A Framework for Evaluating Spoken Dialogue Agents