ESSLLI 2001 Helsinki Languages for the Annotation and Specification of Dialogues Languages for the...

ESSLLI 2001Helsinki

Languages for the Annotation and Specification of Dialogues


(updated 31-Oct-2001)

Gregor Erbach

([email protected])

ESSLLI 2001Helsinki


Course Outline

1. Introduction to Spoken Dialogue Systems2. Linguistic Resources in SDS3. Developing Spoken Dialogue Applications4. Annotation of Dialogues

– Uses of annotated dialogues– Levels of annotation, multilevel annotation– Annotation Graphs– Annotation Frameworks (ATLAS)

5. Introduction to XML6. Dialogue Annotation in XML (MATE)

ESSLLI 2001Helsinki


Outline (2)

7. Evaluation of Spoken Dialogue Systems 8. Dialogue Specification Languages

– Behaviouristic Models (pattern-response)– Finite-State Models– Slot-Filling– Condition-Action Rules (HDDL)– Planning– Re-usable Dialogue Behaviours: SpeechObjects

9. Voice XML10. Research Challenges

ESSLLI 2001Helsinki


1. Spoken Dialogue Systems

• Human-machine dialogue differs from human-human dialogue:

– limited natural-language understanding

– limited vocabulary

– limited back-channel

– limited world knowledge and inference capabilities

– limited social and emotional competence

– speech recognition errors

• Design and implementation of dialogue system is a discipline between science and engineering

ESSLLI 2001Helsinki



Dialog System Architecture

speechoutput

dialoguecontrol

applicationlogic /reasoning database /

knowledge base

speech under-standing

ESSLLI 2001Helsinki



Dialogue Modelling

Interaction Model

Language Model Dialogue Model

(from Bernsen, Dybkjær and Dybkjær, 1998)

ESSLLI 2001Helsinki



Speech and Audio Processing

Speech Understanding

Signal processing:

– Convert the audio wave into a sequence of feature vectors

Speech recognition:

– Decode the sequence of feature vectors into a sequence of words

Semantic interpretation:

– Determine the meaning of the recognized words

Speech Output

Speech generation:

– Generate marked-up word string from system semantics

Speech synthesis:

– Generate synthetic speech from a marked-up word string

ESSLLI 2001Helsinki



Automatic Speech Recognition (ASR)

• Research activities since the 1950s

• Widespread commercial use since a number of years, enabled by increased processor power, memory and better software engineering

• speech recognisers can be implemented on PCs as software-only applications

ESSLLI 2001Helsinki



ASR Fundamentals

• Digitisation of the acoustic signal

• Signal analysis: distribution of acoustic energy over time and frequency, represented as feature vectors

• Matching against stored patterns (acoustic models)

• Selection of the best pattern by using linguistic knowledge and world knowledge

ESSLLI 2001Helsinki



Signal Analysis

(Output of the speech analysis tool PRAAT)

ESSLLI 2001Helsinki



Challenges in ASR

• speaker-independent recognition

• Variation of speakers (age, dialect, diseases ...)

• Vocabulary size

• Continuous speech

• Spontaneous speech

• Background noise

• Distorted signal transmission

ESSLLI 2001Helsinki



Difficulty vs. Vocabulary Size

10 100 1000 10000 100000 1M

Dialoguesystem

Dictation System

Task Difficulty

Device control

VoiceDialling

Vocabulary Size

ESSLLI 2001Helsinki



The Speech Recognition Problem

• Bayes’ Law

– P(a,b) = P(a|b) P(b) = P(b|a) P(a)

– Joint probability of a and b = probability of b times the probability of a given b

• The Recognition Problem

– Find most likely sequence w of “words” given the sequence of acoustic observation vectors a

– Use Bayes’ law to create a generative model

– ArgMaxw P(w|a) = ArgMaxw P(a|w) P(w) / P(a)

= ArgMaxw P(a|w) P(w)

• Acoustic Model: P(a|w)

• Language Model: P(w) (from Carpenter and Chu-Carroll, 1998)

ESSLLI 2001Helsinki



Pronunciation Modelling

• Needed for speech recognition and synthesis

• Maps orthographic representation of words to sequence(s) of phones

• Dictionary doesn’t cover language due to:

– open classes

– names

– inflectional and derivational morphology

• Pronunciation variation can be modeled with multiple pronunciation and/or acoustic mixtures

• If multiple pronunciations are given, estimate likelihoods

• Use rules (e.g. assimilation, devoicing, flapping), or statistical transducers

(from Carpenter and Chu-Carroll, 1998)

ESSLLI 2001Helsinki



Language Modelling

• Assigns probability P(w) to word sequence w = w1 ,w2,…,wk

• Bayes’ Law provides a history-based model:

P(w1 ,w2,…,wk)

= P(w1) P(w2|w1) P(w3|w1,w2) … P(wk|w1,…,wk-1)

• Cluster histories to reduce number of parameters


ESSLLI 2001Helsinki



N-Gram Language Modelling

• n-gram assumption clusters based on last n-1 words

– P(wj|w1,…,wj-1) ~ P(wj|wj-n-1,…,wj-2 ,wj-1)

– unigrams ~ P(wj)

– bigrams ~ P(wj|wj-1)

– trigrams ~ P(wj|wj-2 ,wj-1)

• Trigrams often interpolated with bigram and unigram:

– thei typically estimated by maximum likelihood estimation on held out data (F(.|.) are relative frequencies)

– many other interpolations exist (another standard is a non-linear backoff)

k kk kk k wF

wF

wwF

wwF

wwwF

wwwFwww

)(

)(

)|(

)|(

),|(

),|(),|(P̂ 3

12

232

21

2133213


ESSLLI 2001Helsinki



Recognition Grammars• Restrict the possible user inputs at each step of the dialogue

• Restriction of possible inputs is necessary for speaker-independent systems to improve recognition accuracy

• Recognition grammars in commercial dialogue systems are generally regular or context-free grammars

• Dynamically generated grammars can be used which are adapted to the state of the dialogue

• Closed grammars match user input from beginning to end

• Open grammars match parts of the user input

ESSLLI 2001Helsinki



Finite-State Language Models

• Write a finite-state task grammar (with non-recursive CFG)

• Simple Java Speech API example (from user’s guide):

public <Command> = [<Polite>] <Action> <Object> (and <Object>)*;

<Action> = open | close | delete;

<Object> = the window | the file;

<Polite> = please;

• Typically assume that all transitions are equi-probable

• Technology used in most current applications

• Can put semantic actions in the grammar(from Carpenter and Chu-Carroll, 1998)

ESSLLI 2001Helsinki



Java Speech Grammar Format• Java Speech Grammar Format (JSGF) is a widely used format

for recognition grammars

<xyz> Grammatical Category xyz

* Repetition (0 to n times)

+ Repetition (1 to n times)

(...) Grouping

[...] Grouping, optional

| Alternatives

/n/ Alternative with weight n

ESSLLI 2001Helsinki



Word hypothesis graphs• Keep multiple tokens and return n-best paths/scores:

– p1 flights from Boston today

– p2 flights from Austin today

– p3 flights for Boston to pay

– p4 lights for Boston to pay• Can produce a packed word graph (a.k.a. lattice)

– likelihoods of paths in lattice should equal likelihood for n-best

flights

lights

from

for

for

Boston

Boston

Austin

today

topay


ESSLLI 2001Helsinki



Measuring Recognition Performance

• Word Error Rate =

• Example scoring:

– actual utterance: four six seven nine three three seven

– recognizer: four oh six seven five three seven

insert subst delete

– WER: (1 + 1 + 1)/7 = 43%• Would like to study concept accuracy

– typically count only errors on content words [application dependent]

– ignore case marking (singular, plural, etc.)• For word/concept spotting applications:

– recall: percentage of target words (concept) found

– precision: percentage of hypothesized words (concepts) in target

Words

onsSubstitutiDeletionsInsertions


ESSLLI 2001Helsinki



Dictation vs. Dialogue System

Only certain pat-terns are recognised at each step

Unrestricted, including complex sentences

Nature of the

User Input

Several thousand words, of which a subset is active

Up to 100.000 words, always active

Vocabulary

Size

Speaker-independentSpeaker-dependent or speaker-adaptive

(must be trained for each speaker)

Speaker

dependence

Dialogue system Dictation system

ESSLLI 2001Helsinki



Speaker Verification

• Speaker verification: confirm the claimed identity of a speaker

• Speaker identification: recognition of one speaker among a group of potential candidates

• Evaluation by means of the ratios "false acceptance" and "false rejection"

• One measure can be improved at the expense of the other

• For high-security applications, speaker verification should be combined with other methods (password, chip card, biometrics...).

ESSLLI 2001Helsinki


2. Linguistic Resources for Dialogue Systems

• Acoustic Models

• Phonetic Lexicon

• Language Models (Grammars)

• Dialogue Specifications

• System Output (Prompts)

• Training data: annotated human/human or human/machine dialogues

ESSLLI 2001Helsinki


2. Lingusistic Resources for SDS

Acoustic Models

• Tri-phone HMMs

• transcribed speech used for training

• orthographic transcriptions + noise markers + phonetic lexicon

• SpeechDat is a standard format for transcription. Each audio file is associated with a label file which contains the transcription plus information about the speaker (age, sex, education level) and the call (telephone network, environment)

ESSLLI 2001Helsinki


LHD: SAM, 6.0DBN: SpeechDat_Austrian_MobileVOL: MOBIL1AT_01SES: 0099DIR: \MOBIL1AT\BLOCK00\SES0099SRC: B10099C2.ATACCD: C2BEG: 0END: 63487REP: Connect Austria, ViennaRED: 02/Jan/2000RET: 16:15:45SAM: 8000SNB: 1SSB: 8QNT: A-LAW


SPEECHDAT Label File

SCD: 000099SEX: FAGE: 22ACC: NOEREG: WienENV: HOMENET: MOBILE, A1PHM: UNKNOWN, EFRSHT: 600-0663EDU: MATURANLN: DE-ATASS: OKLBD:LBR: 0,63487,,,,0354/329 851LBO: 0,,63487,[sta] null drei fünf

vier drei zwei neun acht fünf eins

ESSLLI 2001Helsinki



Phonetic Lexicon

• The phonetic lexicon consists of pairs <orthography, phonetic-representation+>, where the phonetic symbols correspond to the acoustic models used in the speech recogniser

• Phonetic lexicons are also used for text-to-speech synthesis.

• Example (with SAM-PA transcriptions):Abkehr a p k e:6

Abkommen a p + k O m @ n a p k O m @ n

Abkommens a p k O m @ n s

Ablauf a p l aU f

Ablegers a p l e: g 6 s

ESSLLI 2001Helsinki



Language Models

• Two kinds of language models are widely used: statistical language models and recognition grammars

• Statistical LMs are generally used for dictation systems

• Recognition grammars are often used for speaker-independent dialogue systems

• Recognition grammars are often finite-state models, or non left-recursive context-free grammars

• Statistical LMs and recognition grammars can be combined (e.g. Philips, Nuance 8)

• Language models can be trained or optimised using text corpora or transcriptions of dialogues

ESSLLI 2001Helsinki



Dialogue Specifications

• Dialogue specifications are used to control the flow of the dialogue

• Dialogue specifications can be expressed

– as executable code in some programming language

– as a task model

– in some dialogue specification language

• Dialogue specifications must provide repair strategies to deal with recognition failures and unacceptable user input

ESSLLI 2001Helsinki



System Output (Prompts)

• Prompts are the speech output provided to the user of the dialogue system

• Prompts should

– be clear and understandable

– encourage the user to produce system-friendly speech input

– convey the personality chosen for the system

• Other audio sounds ("earcons") can be used in addition to prompts to provide orientation

• Prompts can be pre-defined, constructed by concatenation of partial prompts, or produced by a NL generator

ESSLLI 2001Helsinki



Speech Output

• Recorded vs. synthesised speech

• Recorded speech has higher user acceptance

• Ensure smooth transitions and appropriate prosody when concatenating recorded speech

• In case of large or highly variable vocabulary, speech synthesis must be used.

• Speech synthesisers are evaluated according to intelligibility and naturalness.

ESSLLI 2001Helsinki



Training data: annotated dialogues

• Transcribed speech data (not necessarily dialogues) for training of speech recogniser

• Text data (ideally transcriptions of dialogues from a running application) for training of language models and/or optimization of recognition grammars

• Labelled dialogues to determine the likely sequence of dialogue acts (dialogue grammar)

• Dialogues labelled with communication failures and emotional markup for optimizing dialogue specifications

• Annotated dialogues as a resource for system evaluation

ESSLLI 2001Helsinki


3. Developing Spoken Dialogue Applications

• Conflicting requirements: system "intelligence" vs. control of the dialogue flow

• Imperfections of speech recognition (errors are the rule, not the exception)

• Limited "understanding" of user utterances (out of vocabulary, out of grammar)

• Dialogue system must take the initiative after dialogue failure and try to recover from the errors

• Personality of the dialogue application

ESSLLI 2001Helsinki



Development Process

1. Requirements specification

2. Definition of dialogue flow

3. Rapid prototyping or Wizard-of-Oz Experiment (outputs: annotated dialogues, questionnaires, interviews)

4. Pilot system with basic functionality

5. Internal Tests

6. Trascription and annotation of dialogues

7. Optimisation of system functionality

8. Tests with external users

9. Extension and tuning of the system

10. Unless satisfactory system performance: go to 5

ESSLLI 2001Helsinki



Tasks and Roles

• Gather requirements and produce requirement specification (Analyst)

• Specify dialogue flow (Dialogue Designer)

• Define prompts (Interaction Designer)

• Write and optimise recognition grammars (Grammar Writer)

• Usability testing with "real" users (Usability Tester)

• Transcribe and annotate dialogues from usability testing and deployed application (Annotator)

• Test and optimize grammars, language models and dialogues (Quality Assurance Engineer, Grammar Writer, Dialogue Designer)

• System Integration (Software Engineer)

ESSLLI 2001Helsinki



Dialogue Initiative

1. System initiative

for systems that are not regularly used by the same users

2. User initiative

experienced users can issue commands without system prompts

3. Mixed initiative

e.g., for user questions or activation of help functionality

Over-answering of questions by the user

ESSLLI 2001Helsinki



Barge-in

• "Barge-In" is the interruption of system output by user input

• Advantages:

– Possibility to interrupt long system outputs (e.g. timetable information, reading of e-mails)

– Faster answering of system questions for regular users• Problems:

– Interruption of system output through background noise or side speech (to or from colleagues or children)

– Echo cancellation required to avoid activation of barge-in by system output

ESSLLI 2001Helsinki



Verification of User Input

• Verification is the confirmation of user input by the system, with a possibility of correction

• Explicit Verification: User must confirm the input explicitly, usually by saying "yes" or "no"

• Implicit Verification: The user's input is repeated, and accepted if the user does not contradict.

ESSLLI 2001Helsinki



Repair Strategies

• Misunderstandings and communication problems are common in human-human and in human-machine dialogues

• Repair strategies are used for recovering from communication failure.

• The relatively poor performance of speech recognisers causes many misunderstandings

• Repair strategies must therefore be part of every practical dialogue system

ESSLLI 2001Helsinki



Causes of Communication Problems

• No speech detected (volume too low)

• Failure to detect beginning or end of speech accurately (endpointing)

• Misrecognitions or no recognition results due to

– background noise

– distorted speech transmission (microphone, phone line)

– out-of-vocabulary words

– out-of-grammar input

– speaker variation

– lack of semantic interpretation

ESSLLI 2001Helsinki



Engineering Issues

• Telephony interfaces (analog, digital, Voice over IP)

• Speech signal quality

• Echo cancellation

• Scalability to hundreds or thousands of concurrent dialogues

• Real-time speed

• System integration

ESSLLI 2001Helsinki



Speech Recognition API• Abstract from recognition/synthesis engines

• Recognizer and synthesizer loading

• Acoustic and grammar model loading (dynamic updates)

• Recognition

– online

– n-best or lattice

• Synthesis

– markup

– barge in

• Acoustic control

– telephony interface

– microphone/speaker interface(from Carpenter and Chu-Carroll, 1998)

ESSLLI 2001Helsinki



Case Study: Postal Rates System

• Technical Basis

– Philips SpeechMania with Austrian German acoustic models

– Windows NT 4.0

– Pentium III (Minimum: Pentium II, 300 MHz)

– 256 MB RAM (Minimum: 64 MB)

– ISDN card• Co-operation between Forschungszentrum Telekommunikation Wien

(ftw.) und the Austrian Post Office

ESSLLI 2001Helsinki



Case Stude: System Requirements

• Mostly casual users

• System initiative

• Over-answering is allowed

• No Barge-In

• Speed of Interaction is important

ESSLLI 2001Helsinki



Example Dialogue (1)

S1 (System): Möchten Sie ein Paket oder eine Briefsendung verschicken?

B2 (Benutzer): eine Briefsendung

S3: Möchten Sie|den Brief|ins Inland oder ins Ausland verschicken?

B4: ins Ausland

S5: In welches Zielland wollen Sie|den Brief|schicken?

B6: nach Italien

S7: Soll der Brief|nach Italien|als Priority versendet werden?

B8: nein danke bitte ich richtig (nein danke, das ist nicht nötig)

S9: Bitte geben Sie das Gewicht des Briefes an

ESSLLI 2001Helsinki



Example Dialogue (2)

B10: Niue Chile Sudan Bermuda zwanzig Gramm (na ich schätz so zwei hundert zwanzig Gramm)

S11: nach Italien|sind folgende Sonderbehandlungen zulässig: Einschreiben, Eilsendung und Nachnahme. Wollen Sie eine dieser Sonderbehandlungen nutzen, wenn ja, welche?

B12: nein richtig bitte kostet hier (nein das ist mir zu kompliziert)

S13: Danke für Ihre Angaben

S14: Ein Brief|nach Italien|20|Gramm schwer, der|priority|versendet wird, kostet|7|Schilling

S15: Benötigen Sie eine weitere Auskunft?

B16: nein danke

ESSLLI 2001Helsinki


4. Dialogue Annotation

• Purpose of dialogue annotation

– Linguistic description and analysis on different levels

– Resources for conversation analysis (sociological, socio-linguistic research)

– Resources for system engineering (acoustic models, language models)

– Resources for application development (Prompts, recognition grammars, dialogue design)

– Resources for system evaluation

ESSLLI 2001Helsinki



Annotation Schemas

• Corpus Encoding Standard

• MATE

• ATLAS

• DAMSL

The MATE project provides a good overview of annotation schemas

ESSLLI 2001Helsinki



Spoken Dialogue Corpora

• Human-Human– Call Home (spontaneous telephone speech)

– Map Task (direction giving on a map)

– Switchboard (task-oriented human-human dialogues)

– Childes (child language dialogues)

– Verbmobil (appointment scheduling dialogues)

– TRAINS (task-oriented dialogues in railroad freight domain)

• Human-Machine– Danish Dialogue System (57 dialogues, domestic flight reservation)

– Philips (13500 dialogues, train timetable information)

– Sundial (100 Wizard of Oz dialogues, British flight information)

ESSLLI 2001Helsinki


Audio Properties in Corpora

• Sampling rate (samples/sec, Hz)

• Audio resolution (bit)

• Linear vs. logarithmic coding (a-law, -law)

• Mono vs. Stereo

• Type of microphone and recording environment

• Audio coding / compression


ESSLLI 2001Helsinki



Map Task Corpus

• Map Task is a cooperative task involving two participants who sit opposite one another and each has a map which the other cannot see

• One speaker (Instruction Giver) has a route marked on her map; the other speaker (Instruction Follower) has no route

• Speakers are told that the goal is to reproduce the Instruction Giver's route on the Instruction Follower's map

• Speakers know that the maps are not identical• 128 digitally recorded unscripted dialogues and 64 citation form readings of

lists of landmark names• Transcriptions and a wide range of annotations are available as XML

documents• Separation of corpus and annotation

ESSLLI 2001Helsinki



Levels of Annotation

• phonetic / phonological / orthographic

• prosody

• morphology / syntax / semantics

• co-reference

• dialogue acts

• turn-taking

• cross-level

• acoustic (noise, phone line characteristics)

• communication problems

• speech recognition results (human-machine dialogues)

ESSLLI 2001Helsinki



Dialogue Acts

• Dialogue Moves (MapTask)

• Six initiating moves– instruct - commands the partner to carry out an action – explain - states information which has not been elicited by the partner – check - requests the partner to confirm information – align - checks the attention or agreement of the partner– query-yn - asks a question which takes a "yes" or "no" answer – query-w - any query which is not covered by the other categories

• One pre-initiating move– ready - a move which occurs after the close of a dialogue game and

prepare the conversation for a new game to be initiated

ESSLLI 2001Helsinki


• Five response moves: – acknowledge - a verbal response which minimally shows that the

speaker has heard the move to which it responds – reply-y - any reply to any query with a yes-no surface form which

means "yes", however that is expressed – reply-n - a reply to a a query with a yes/no surface form which means

"no" – reply-w - any reply to any type of query which doesn't simply mean

"yes" or "no" – clarify - a repetition of information which the speaker has already

stated, often in response to a check move


Dialogue Acts (2)

ESSLLI 2001Helsinki



Dialogue Grammars

• sequencing regularities in dialogue (adjacency pairs)

• capture the fact that questions are generally followed by answers, proposals by acceptances,

• dialogues are a collection of such act sequences, with embedded sequences for digressions and repairs

• Dialogue grammars can be used to predict the next dialogue act of the user

ESSLLI 2001Helsinki


Dialogue Acts (DAMSL)

ESSLLI 2001Helsinki



Cross-Level Annotation

• Cross-level annotation provides links between different levels of annotation

• Useful for annotation of communication problems, which can be caused by phenomena in other levels (e.g. morphosyntax, coreference …)

• XML IDs and references provide a mechanism for annotation of cross-level phenomena

ESSLLI 2001Helsinki



Annotation Graphs

• 'Linguistic Annotation' covers any descriptive or analytical notations applied to raw language data

• The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual

• Annotation graphs focus on the logical structure of linguistic annotations, not on file formats

• Annotation graphs provide a common conceptual core for a wide variety of existing annotation formats

(Bird and Liberman, 2001)

ESSLLI 2001Helsinki



Annotation Graphs: formal definition

• An annotation graph G over a label set L and timelines (Ti,i) is a 3-tuple <N,A,> consisting of a node set N, a collection of arcs labelled with elements of L, and a time function : N Ti, which satisfies the following conditions:

1. <N,A> is a labelled acyclic digraph containing no nodes of degree zero.

2. For any path from node n1 to n2 in A, if (n1) and (n2) are defined, then there is a timeline i such that (n1) i (n2)

ESSLLI 2001Helsinki


4. Excursion: XML

• XML = Extensible Markup Language

• successor of SGML

• W3C standard

• very versatile; used for markup of texts, data interchange, databases, description of chemical structures, annotation of dialogues (e.g., MATE), specification of dialogues (e.g., VoiceXML) among others

• can describe any tree structure with complex node lables

• description of graph structures with identifiers and references

• An XML document consists of entities, elements and attributes

ESSLLI 2001Helsinki


5. Introduction to XML

History of XML

(La)TeX ...

SGML

HTML XML

XHTML VXML MATE

hyperlinking

ESSLLI 2001Helsinki



XML Elements and Attributes

• Elements delimit sections of documents

<phone><country>49</country><city>30</city><number>345077</number><ext>62</ext>

</phone>

• Attributes add information to elements

<phone type=mobile><country name="de"><net operator="vi" type="gsm1800">179</net><number status="secret">1238189</number>

</phone>

ESSLLI 2001Helsinki



ID and ID reference

• An ID attribute uniquely identifies an XML element

<person id="123"><name><first>Tony</first><last>Blair</last></name>

</person>

• An ID reference points to an element identified by an ID

<government><prime_minister idref="123"/><defense_minister idref="321"/>

</government>

ESSLLI 2001Helsinki



XML DTD

• The DTD (document type definition) is a grammar that defines valid XML documents

<!ELEMENT PHONE (COUNTRY,NET, NUMBER)>

<!ATTLIST PHONE

type CDATA #IMPLIED>

<!ELEMENT COUNTRY EMPTY>

<!ATTLIST COUNTRY>

name CDATA #REQUIRED>

<!ELEMENT NET (#PCDATA)>

<!ATTLIST NET

operator CDATA #IMPLIED

type CDATA #IMPLIED>

ESSLLI 2001Helsinki


6. Dialogue Annotation in XML

MATE Annotation Tools

• MATE addresses the problems of creating, acquiring, and maintaining language corpora

1.through the development of a standard for annotating resources

2.through the provision of tools which make the processes of knowledge acquisition and extraction more efficient

• MATE treats spoken dialogue corpora at multiple levels, focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and communicative difficulties, as well as cross-level interaction.

ESSLLI 2001Helsinki



MATE Timed-Unit File

<timed_unit_stream id ="xyz"><tu id="q1ec1g.1" start="0.0000" end="0.3294" utt="1">okay</tu><tu id="q1ec1g.4" start="0.3294" end="0.8432" utt="1">starting</tu><tu id="q1ec1g.5" start="0.8432" end="1.3702" utt="1">off</tu><sil id="q1ec1g.6" start="1.3702" end="1.5777"/><tu id="q1ec1g.7" start="1.5777" end="1.8413" utt="1">we</tu><tu id="q1ec1g.8" start="1.8414" end="2.2201" utt="1">are</tu><sil id="q1ec1g.9" start="2.2201" end="2.3518"/><tu id="q1ec1g.10" start="2.3518" end="2.8722" utt="1">above</tu><sil id="q1ec1g.11" start="2.8722" end="2.9644"/><tu id="q1ec1g.12" start="2.9644" end="3.0369" utt="1">a</tu><tu id="q1ec1g.13" start="3.0369" end="3.5244" utt="1">caravan</tu><tu id="q1ec1g.14" start="3.5244" end="3.9394" utt="1">park</tu><noi id="q1ec1g.16" start="3.9394" end="4.2885" type="nonvocal"/><sil id="q1ec1g.17" start="4.2885" end="4.5784"/><noi id="q1ec1g.18" start="4.5784" end="4.8617" type="lipsmack"/><noi id="q1ec1g.19" start="4.8617" end="5.3492" type="breath"/></timed_unit_stream>

ESSLLI 2001Helsinki



MATE Timed Unit DTD

<!ELEMENT timed_unit_stream (tu|sil|noi)* ><!ATTLIST timed_unit_stream id ID #REQUIRED><!ELEMENT tu (#PCDATA)><!ATTLIST tu id ID #REQUIRED start CDATA #REQUIRED end CDATA #REQUIRED utt CDATA #IMPLIED realisation CDATA #IMPLIED ><!ELEMENT sil EMPTY><!ATTLIST sil id ID #REQUIRED start CDATA #REQUIRED end CDATA #REQUIRED utt CDATA #IMPLIED>

ESSLLI 2001Helsinki



MATE Timed Unit DTD (2)

<!ELEMENT noi EMPTY><!ATTLIST noi id ID #REQUIRED start CDATA #REQUIRED end CDATA #REQUIRED utt CDATA #IMPLIED type (lipsmack|outbreath|inbreath|breath|laugh|nonvocal| phongesture|unintelligible|lowamp|cough|external)

#REQUIRED >

ESSLLI 2001Helsinki



Dialogue MovesMarkup in MATE

<move id="q1ec1.g.move.1" who="giver" label="ready" href="&gfile;#id(q1ec1g.1)"/>

<move id="q1ec1.g.move.2" who="giver" label="instruct" href="&gfile;#id(q1ec1g.4)..id(q1ec1g.14)"/>

<ims id="q1ec1.g.move.3.5" who="giver" href="&gfile;#id(q1ec1g.16)..id(q1ec1g.19)"/>

ID references (HREF) refer to times units on previous slide

ESSLLI 2001Helsinki



MATE Dialogue Moves DTD (2)

<!ELEMENT ims (sil|noi)*>

<!ATTLIST ims

id ID #REQUIRED

who (giver | follower) #IMPLIED

%embedHyperlinkAttrs;>

ESSLLI 2001Helsinki



Architectures for Annotation

XML

text filesRDB

Evaluationsoftware

Annotation tools

Query tools

Conversiontools

Extractiontools

XML

text filesRDB

Evaluationsoftware

Annotation tools

Query tools

Conversiontools

Extractiontools

AG API

Two-level architecture Three-level architecture(Bird and Liberman 2001)

ESSLLI 2001Helsinki


7. Evaluation of Dialogue Systems

• Blackbox evaluation: overall system performance is judged, but not its internal components

– Examples: task success, contextual appropriateness, user satisfaction

• Glassbox evaluation: system components are evaluated

– Examples: Word Accuracy, Concept Accuracy

• Subjective measures (e.g. user satisfaction) require human judgement; objective measures do not

ESSLLI 2001Helsinki


7. Evaluation of Spoken Dialogue Systems

Evaluation: Turing Test

• Invented by computer science pioneer Alan Turing

• A system passes the Turing test if a human interlocutor cannot distinguish between human and machine

• Turing test does not really test system intelligence or appropriate dialogue behaviour, but getting away with simple behaviours

• The winning systems of the Loebner Prize often simulate paranoid or otherwise pathological dialogue behaviour

• It is not used as a serious evaluation method

ESSLLI 2001Helsinki



Evaluation Measures

Evaluate user's impression

Measure of the overall impression of a system

User satisfaction

(subjective)

Evaluate efficiency for the user

Time required for completing a transaction

productivity

(objective)

Evaluate the system's usability

Proportion of transactions that are successfully completed by the user

task success rate

(objective)

Evvaluate performance of speech recogniser (and language models)

Proportion of user's input words (domain concepts) that are correctly recognised

word (concept) accuracy

(objective)

purposeinterpretationmeasure

ESSLLI 2001Helsinki



Explaining User Satisfaction

• PARADISE evaluation research project at AT&T Labs

• Different objective evaluation metrics are used

– task success

– word accuracy

– dialogue cost (number of turns, number of repairs, etc.)

• Linear regression analysis is used to determine the relative contribution of the different objective criteria to subjective user satisfaction

• The method can be applied to whole dialogues and to subdialogues

• Different dialogue strategies can be compared

ESSLLI 2001Helsinki


8. Dialogue Specification

• Specification of the dialogue flow is a critical factor in the development of spoken dialogue systems

• Approaches for the definition of dialogue flow:– Behaviouristic models (stimulus-response)

– Flowcharts (e.g. CSLU toolkit)

– Slot-filling (e.g. VoiceXML)

– Condition-Action Rules (e.g. HDDL)

– Planning

– Re-usable components (e.g. Nuance SpeechObjects)

– Information State

– Event-driven

ESSLLI 2001Helsinki



Behaviouristic Models

• Dialogue behaviour is determined by pattern/response pairs

• Such systems are generally referred to as chatbots or chatterbots because they do "smalltalk"

• ELIZA is an early system (Weizenbaum, 1960s) simulating a non-directive psychologist

• Commerical systems available from companies like Kiwi-Logic or Artificial Life

• Cannot carry out a goal-oriented dialogue, but useful for reacting to certain situations, e.g. FAQs

ESSLLI 2001Helsinki



Dialogue Spec: Finite-State Models

• Clear flow of dialogue

• limited flexibility in dialogue flow

• very unwieldy for more complex dialogues

requestdestination

requestdeparture

requestdate

listflights

requestdep. time

greeting

requestarr. time

bye

ESSLLI 2001Helsinki



Dialogue Spec: Finite-State Models (2)

requestdestination

requestdeparture

requestdate

listflights

requestdep. time

greeting

requestarr. time

bye

sorry

no

nono

no

noyes yesyes

yesyes

yes

(from Androutsopoulos and Aretoulaki, in press)

ESSLLI 2001Helsinki



Dialogue Spec: Finite-State Models

RapidApplicationDeveloper (RAD)from CSLU toolkit

ESSLLI 2001Helsinki



Dialogue Spec: Slot Filling

• System asks for missing information

• Over-answering can be handled easily

• Flexible dialogue flow

ESSLLI 2001Helsinki



Slot Filling: Example

Departure_Airport [London, Manchester, Glasgow, Birmingham]

Arrival_Airport [London, Manchester, Glasgow, Birmingham]

Departure_Date [<DATE>]

Departure_Time [<TIME_OF_DAY>, morning, afternoon, evening]

Number_of_Seats [1 ... 9]

Return_Flight [<BOOLEAN>]

Return_Date [<DATE>]

Return_Time [<TIME_OF_DAY>, morning, afternoon, evening]

ESSLLI 2001Helsinki



Dialogue Spec: Planning

• Dialogue system is given a goal, and tries to achieve the goal through general-purpose planning algorithms.

• Pre-conditions can be specified for a goal

• Example:

Goal: provide flight information

Preconditions:

Know departure airport

Know destination airport

Know flight date and time

Actions: look up flight in database, inform user

ESSLLI 2001Helsinki


Goal: know information X

Precondition: X cannot be inferred from existing knowledge

Action:

Find X in database OR

Ask user about X

• General-purpose planning frameworks facilitate the integration of AI techniques (knowledge bases, inference etc.) into dialogue systems


Dialogue Spec: Planning (2)

ESSLLI 2001Helsinki



Condition-Action Rules

• Condition-Action Rules consist of a condition (COND) and an action

• The rules are checked in sequence until one condition is satisfied. The action of the rule is then executed, and the process starts over again.

• Conditions relate to the status of system variables (unknown, known, verified) or recogniser output (e.g. NO_SPEECH, NOTHING_UNDERSTOOD)

• Slot-filling can be easily implemented by condition-action rules

• Overanswering can be handled well

• Example: HDDL which is used in the Philips SpeechMania dialogue system

ESSLLI 2001Helsinki



HDDL condition-action rule

COND( art == "paket" && !^gewicht )

{

QUESTION(gewicht)

{

INIT

{

"Geben Sie bitte das Gewicht des Pakets an";

}

}

}

ESSLLI 2001Helsinki



Modularisation: Speech Objects

• SpeechObjects are re-usable dialogue modules

• SpeechObjects perform well-defined functions such as taking time and date or taking credit card information (type, number, expiry date, name of cardholder)

• Error handling and verification is built into the speech objects

• Developers can build up their own libraries of re-usable speech objects.

ESSLLI 2001Helsinki


9. VoiceXML

• VoiceXML is a language for the specification of dialogue systems

• VoiceXML is an XML application defined by a DTD (Document Type Description).

• Dialogue flow by "slot-filling" (Form Interpretation Algorithm)

• Processing is similar to the filling of forms in HTML pages

• VoiceXML is a W3C (WWW Consortium ) standard and is supported by a number of companies.

ESSLLI 2001Helsinki


9. VoiceXML

VoiceXML Goals

• Minimize client-server interactions by specifying multiple interactions per document

• Shield applications authors from low-level, platform-specific details

• Separate user interaction code (in VoiceXML) from service logic (CGI scripts)

• Promote service portability across implementation platforms

• Ease of use for simple interactions, and powerful language features for complex dialogues

ESSLLI 2001Helsinki


9. VoiceXML

VoiceXML Architecture

Document Server

VoiceXML InterpreterVoiceXML InterpreterContext

Implementation Platform

Request Document

ESSLLI 2001Helsinki


9. VoiceXML

VoiceXML example

<?xml version="1.0"?><vxml version="1.0"> <form> <field name="drink"> <prompt>Would you like coffee, tea, milk, or nothing?</prompt> <grammar src="drink.gram" type="application/x-jsgf"/> </field> <block> <submit next="http://www.drink.example/drink2.asp"/> </block> </form></vxml>

ESSLLI 2001Helsinki


9. VoiceXML

VoiceXML example dialogue

S (System): Would you like coffee, tea, milk, or nothing?

B (Benutzer): Orange juice.

S: I did not understand what you said.

S: Would you like coffee, tea, milk, or nothing?

B: Tea

S: (continues exectution with the VoixeXML program drink2.asp)

ESSLLI 2001Helsinki


9. VoiceXML

VoiceXML Form Interpretation Algorithm

• select phase: the next form item is selected for visiting.

• collect phase: the next unfilled form item is visited, which prompts the user for input, enables the appropriate grammars, and then waits for and collects an input (such as a spoken phrase or DTMF key presses) or an event (such as a request for help or a no input timeout).

• process phase: an input is processed by filling form items and executing <filled> elements to perform actions such as input validation. An event is processed by executing the appropriate event handler for that event type.

(from VoiceXML 1.0 specification)

ESSLLI 2001Helsinki


10. Challenges

• Combining spoken dialogue and multimedia interaction (multimodal dialogue)

• Combining speech recognition and pointing/clicking on the display

• Combining speech output with (animated) graphics or video

• Adaptation to the user

• Adaptation to the communicative situation

• Defining a dialogue specification language that is easy to use, and expressive enough to model complex dialogue behaviours

• Learning from annotated dialogues (e.g. Jornsson 93)

ESSLLI 2001 Helsinki Languages for the Annotation and Specification of Dialogues Languages for the...

Documents

Transcript of ESSLLI 2001 Helsinki Languages for the Annotation and Specification of Dialogues Languages for the...