Characterizing Task-Oriented Dialog using a Simulated ASR Channel Jason D. Williams Machine...

Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Jason D. Williams

Machine Intelligence LaboratoryCambridge University Engineering Department

• Motivation for the data collection

• Experimental set-up

• Transcription & Annotation

• Effects of ASR error rate on…– Turn length / dialog length

– Perception of error rate

– Task completion

– “Initiative”

– Overall satisfaction (PARADISE)

SACTI-1 CorpusSimulated ASR-Channel – Tourist Information

ASR channel vs. HH channel

HH dialog ASR channel

Properties

• Frequent but brief overlaps

• 80% of utterances contain fewer than 12 words; 50% < 5

• Approximately equal turn length

• Approximately equal balance of initiative

• About half of turns are ACK (often spliced)

• Few overlaps

• Longer system turns; shorter user turns

• Initiative more often with system

• Virtually no turns are ACK

• Virtually no splicing

Observations

Are models of HC dialog/grounding appropriate in the presence of the ASR channel?

• “Instant” communication

• Effectively perfect recognition of words

• Prosodic information carries additional information

• Turns explicitly segmented

• Barge-in, End-pointed

• Prosody virtually eliminated

• ASR & parsing errors

My approach1. Study the ASR channel in the abstract

– WoZ experiments using a simulated ASR channel– Understand how people behave with an “ideal” dialog manager

• For example, grounding model• Use these insights to inform state space and action set selection

– Note that collected data has unique properties useful to:• RL-based systems• Hidden-state estimation• User modeling

2. Formulate dialog management problem as a POMDP– Decompose state into BN nodes – for example:

• Conversation state (grounding state)• User action• User belief (goal)

– Train using data collected– Solve using approximations

The paradox of “dialog data”• To build a user model, we need to see the user’s

reaction to all kinds of misunderstandings• However, most systems use a fixed policy

– Systems typically do not take different actions in the same situation

– Taking random actions is clearly not an option!

– Constraining actions means building very complex systems…

• … and which actions should be in the system’s repertoire?

An ideal data collection…• …would show users reactions to a variety of

error handling strategies (no fixed policy)• BUT would not be nonsense dialogs!

• …would use the ASR channel

• …would explore a variety of operating conditions – e.g., WER rate

• …would not assume a particular state space

• ... would somehow “discover” the set of system actions

Data collection set-up

ASR-noise generator

End-pointer

Subject

Typist

Wizard

Wizard’s screen shows corrupted input

Wizard talks into microphone

ASR simulation state machine • Simple energy-based barge-in

• User interrupts wizard

SILENCE USER_TALKING

WIZARD_TALKING TYPIST_TYPING

Wizard starts talking

Wizard stops talking

Userstarts talking

Userstops talking

Userstarts talking

Typist done; reco result displayed

ASR simulation• Simplified FSM-based recognizer

– Weighted finite state transducer (WFST)

• Flow– Reference input – Spell-checked against full dictionary– Converted to phonetic string using full dictionary– Phonetic lattice generated based on confusion model– Word lattice produced– Language model composed to re-score lattice– “De-coded” to produce word strings– N-Best list extracted.

• Various free variables to induce random behavior• Plumb N-Best list for variability

ASR simulation evaluation• Hypothesis: Simulation produces errors similar to errors

induced by additive noise w.r.t. concept accuracy• Assess concept accuracy as F-Measure

using automated, data-driven procedure (HVS model)

• Plot concept accuracy of:– Real additive noise– A naïve confusion model (simple insertion,

substitution, deletion)– WFST confusion model

• WFST appears to follow real data much more closely than naïve model

Scenario & Tasks• Tourist / Tourist information scenario

– Intentionally goal-directed– Intentionally simple tasks– Mixtures of simple information gathering and basic planning

• Wizard (Information giver)– Access to bus times, tram times, restaurants, hotels, bars, tourist attraction information, etc.

• User given series of tasks • Likert scores asked at end of each task• 4 Dialogs / user; 3 users/Wizard

Example task: Finding the perfect hotelYou’re looking for a hotel for you and your travelling partner that meets a number of requirements.You’d like the following: En suite rooms Quiet rooms As close to the main square as possibleGiven those desires, find the least expensive hotel. You’d prefer not compromise on your requirements, but of course you will if you must!Please indicate the location of the hotel on the map and fill in the boxes below.

Name of accommodation Cost per night for 2 people

Shoppingarea

Tourist info

Post office

Cinema

Park

Fountain

Tower

Castle

Main square

Middle Road

Red B

ridge

Bridge N

ine

Ca

scad

e R

oad

Park Road

Riv

ersi

de D

rive

South Bridge

Castle Loop

Fountain RoadW

est

loop

Tower Hill Road

Alexander S

treet

No

rth

Roa

d

Alexander Street

Ale

xand

er S

treet

Castle Loop

Nine stree

t

Main Street

Art Square

Museum

User’s Map

Wizard’s map

H1

R4

R6

B2

B3

B5

B4

R2

R3

H3

R5

B6

H2

R1 B1

H4

H5

H6

Shoppingarea

Tourist info

Post office

Cinema

Park

Fountain

Tower

Castle

Main square

Middle Road

Red B

ridge

Bridge N

ine

Ca

scad

e R

oad

Park Road

Riv

ersi

de D

rive

South Bridge

Castle Loop

Fountain RoadW

est

loop

Tower Hill Road

Alexander S

treet

No

rth

Roa

d

Alexander Street

Ale

xand

er S

treet

Castle Loop

Nine

street

Main Street

Art Square

Museum

H1

R4

R6

B2

B3

B5

B4

R2

R3

H3

R5

B6

H2

R1 B1

H4

H5

H6

Likert-scale questions• User/wizard each given 6 questions after each task

Subject example:1. In this task, I accomplished the goal.2. In this task, I thought the speech recognition was accurate.3. In this task, I found it difficult to communicate because of

the speech recognition.4. In this task, I believe the other subject was very helpful.5. In this task, the other subject found using the speech

recognition difficult. 6. Overall, I was very satisfied with this past task.

Disagree strongly

(1)

Disagree(2)

Disagree somewhat

(3)

Neither agree nor disagree

(4)

Agree somewhat(5)

Agree(6)

Strongly agree(7)

Transcription• User-side transcribed during experiments

– Prioritized for speed

"I NEED UH I’M LOOKING FOR A PIZZA"

• Wizard-side transcribed using a subset of LDC transcription guidelines – more detail

"ok %uh% (()) sure you –- i can"

epErrorEnd=true

Annotation (acts)• Each turn is a sequence of tags

• Inspired by Traum’s “Grounding Acts”– More detailed / easier to infer from surface words

Tag Meaning

Request Question/request requiring response

Inform Statement/provision of task information

Greet-Farewell “Hello”, “How can I help,” “that’s all”, “Thanks”, “Goodbye”, etc.

ExplAck Explicit statement of acknowledgement, showing speaker understanding of OS

Unsolicited-Affirm Explicit statement of acknowledgement, showing OS understands speaker

HoldFloor Explicit request for OS to wait

ReqRepeat Request for OS to repeat their last turn

ReqAck Request for OS to show understanding

RspAffirm Affirmative response to ReqAck

RspNegate Negative response to ReqAck

StateInterp A statement of intention of OS

DisAck Show of lack of understanding of OS

RejOther Display of lack of understanding of speaker’s intention or desire by OS

Annotation (understanding)

• Each wizard turn was labeled to indicate whether the wizard understood the previous user turn

Label Wizard’s understanding of previous user turn

Full All intentions understood correctly.

Partial Some intentions understood; none misunderstood.

Non Wizard made no guess at user intention.

Flagged-Mis The wizard formed an incorrect hypothesis of the user’s meaning, and signalled a dialog problem

Un-Flagged-Mis The wizard formed an incorrect hypothesis of the user’s meaning, accepted it as correct and continued with the dialog.

Corpus summary

WERtarget

# Wiz # User # Task Completedin time limit

Per-turn WER

Per-dialog WER

None 2 6 24 83 % 0 % 0 %

Low 4 12 48 83 % 32 % 28 %

Med 4 12 48 77 % 46 % 41 %

Hi 2 6 24 42 % 63 % 60 %

Perception of ASR accuracyHow accurately do users & wizards perceive WER?

• Perceptions of recognition quality broadly reflected actual performance, but users consistently gave higher quality scores than wizards for the same WER

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Hi Med Low None

WER Target

Ave

rage

Lik

ert S

core

Wizard User

Average turn length (words)How does WER affect wizard & user turn length?

• Wizard turn length increases• User turn length stays relatively constant

0

5

10

15

20

25

30

35

Hi Med Low None

WER Target

Av

wor

ds/tu

rn

Wizard User

Grounding behaviorHow does WER affect wizard grounding behavior?

• As WER increases, wizard grounding behaviors become increasingly prevalent

0%

20%

40%

60%

80%

100%

Hi Med Low None

WER Target

% o

f all

tags

All OthersDisAckStateInterpReqAckReqRepeatRequestExplAckInform

Wizard understandingHow does WER affect wizard understanding status?

• Misunderstanding increases with WER…• …and task completion falls (83%, 83%, 77%, 42%)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Hi Med Low None

WER Target

% o

f Wiz

ard

Tur

ns UnFlaggedMisFlaggedMisNonPartialFull

Wizard strategies

• Classify each wizard turn into one of 5 “strategies”

Label Meaning wiz init?

Tags

REPAIR Attempt to repair Yes ReqAck, ReqRepeat, StateInterp, DisAck, RejOther

ASKQ Ask task-question Yes Request

GIVEINFO Provide task info No Inform

RSPND Non-initiative taking grounding actions

No ExplAck, Rsp-Affirm, Rsp-Negate, Unsolicited-Affirm

OTHER Not included in analysis n/a All others

Wizard strategiesWhat are the most successful strategy after known dialog trouble?

• This plot shows wizard understanding status one turn after known dialog trouble: effect of ‘REPAIR’ vs ‘ASKQ’.

• ‘S’ indicates significant differences

0%

20%

40%

60%

80%

100%

REPAIR ASKQ REPAIR ASKQ

Hi Med

Strategy & WER Target

% o

f Wiz

ard

Tur

ns UnFlaggedMis

FlaggedMisNonPartialFull

S

S

S

User reactions to misunderstandings

How does a user respond after being misunderstood?• Surprisingly little explicit indication!

WERtarget

User turns including tag

DisAck RejectOther Request

None N/A N/A N/A Low 0.0 % 3.8 % 92.3 % Med 2.5 % 19.0 % 75.9 % Hi 0.0 % 12.3 % 87.0 %

Level of wizard “initiative”How does “initiative” vary with WER?

• Define wizard “initiative” using strategies, above

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Hi Med Low NoneWER Target

% o

f Wiz

ard

Tur

ns

GIVEINFORESPONDASKQREPAIR

Reward measures/PARADISE• Satisfaction = Task completion + Dialog Cost metrics• 2 kinds of user satisfaction:

– Single– Combi

• 3 kinds of task completion– User– Obj– Hyb

• Cost metrics– PerDialogWER– %UnFlaggedMis– %FlaggedMis– %Non– Turns– %REPAIR– %ASKQ

Reward measures/PARADISE• In almost all experiments using the User task completion metric, it was the only

significant predictor

• The single/combi metrics almost always selected the same predictors

Dataset Metrics (task & user sat)

R2 Significant predictors

ALL User-S 52 % 1.03 Task

ALL User-C 60 % 5.29 Task – 1.54 %UnFlagMis

ALL Obj-S 24 % -0.49 Turns + 0.38 Task

ALL Obj-C 27 % -2.43 Turns – 1.45 %UnFlagMis + 1.35 Task

ALL Hyb-S 41 % 0.74 Task – 0.36 Turns

Hi Obj-S 40 % 0.98 Task

Hi Hyb-S 48 % 1.07 Task

Med Obj-S 16 % -0.62 %Non

Med Obj-C 37 % -3.35 %Non – 2.94 Turns

Med Hyb-S 38 % 0.97 Task

Low Obj-S 28 % -0.59 Turns

Low Hyb-S 40 % -0.49 Turns + 0.40 Task

Reward measures/PARADISEWhat indicators best predict user satisfaction?

• When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. – %UnFlaggedMis is serving as a better measurement of understanding accuracy than WER alone, since it effectively combines recognition accuracy with a measure of confidence.

• Broadly speaking:– Task completion is most important at the High WER level– Task completion and dialog quality is most important at the Med WER level– Efficiency is most important at the Low WER level

• These patterns mirror findings from other PARADISE experiments using Human/Computer data• This gives us some confidence that this data set is valid for training Human/Computer systems

Conclusions/Next steps• At moderate WER levels, asking task-related questions appears to be more successful than direct dialog repair. • Levels of expert “initiative” increase with WER, primarily as a result of grounding behavior. • Users infrequently give a direct indication of having been misunderstood, with no clear correlation to WER. • When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. • Task completion appears to be most predictive of user satisfaction; however, efficiency shows some influence at lower WERs.

• Next… apply this corpus to statistical systems.

Thanks!

Jason D. Williams

[email protected]

Characterizing Task-Oriented Dialog using a Simulated ASR Channel Jason D. Williams Machine...

Documents

Transcript of Characterizing Task-Oriented Dialog using a Simulated ASR Channel Jason D. Williams Machine...