Spoken Dialog System Architecture Joshua Gordon CS4706.

Spoken Dialog System Spoken Dialog System ArchitectureArchitecture

Joshua GordonJoshua Gordon

CS4706CS4706

OutlineOutline

Examples of deployed and research SDS Examples of deployed and research SDS architectures / conversational speech interfacesarchitectures / conversational speech interfaces

Discussion of the issues and challenges in SDS Discussion of the issues and challenges in SDS designdesign

A tour of the Olympus SDS Architecture, and a A tour of the Olympus SDS Architecture, and a flyby of basic design considerations pertinent toflyby of basic design considerations pertinent to Recognition Recognition Spoken language understandingSpoken language understanding Dialog management, error handling, belief Dialog management, error handling, belief

updatingupdating Language generation / speech synthesisLanguage generation / speech synthesis Interaction management, turn takingInteraction management, turn taking

Information Seeking, Information Seeking, Transaction Based Transaction Based

Spoken Dialog SystemsSpoken Dialog SystemsWhere we are: most of today’s production systems are Where we are: most of today’s production systems are

designed for database access and call routingdesigned for database access and call routing

Columbia: CheckItOut – Columbia: CheckItOut – virtual librarianvirtual librarian

CMU: Let’s Go! Pittsburg bus CMU: Let’s Go! Pittsburg bus schedulesschedules

Google: Goog411 – directory Google: Goog411 – directory assistance, Google Voice assistance, Google Voice Search Search

MIT – Jupiter – weather MIT – Jupiter – weather informationinformation

Nuance – built to order, Nuance – built to order, technical supporttechnical support

Speech Aware KiosksSpeech Aware Kiosks

“How may I help you? I can provide directoryassistance, and directions around campus.”

Current research at Microsoft: SDS Current research at Microsoft: SDS architectures are beginning to incorporate architectures are beginning to incorporate

multimodal inputmultimodal input

Negotiate an Negotiate an agreement between agreement between soldiers and village soldiers and village elderselders

Both auditory and Both auditory and visual cues used in visual cues used in turn takingturn taking

Prosody, facial Prosody, facial expressions convey expressions convey emotionemotion

Speech Interfaces toSpeech Interfaces toVirtual CharactersVirtual Characters

SGT Blackwell SGT Blackwell

http://ict.usc.edu/projects/sergeant_blackwell/

SDS architectures are exploring multimodal output SDS architectures are exploring multimodal output (including gesturing and facial expression) to (including gesturing and facial expression) to

indicate level of understandingindicate level of understanding

http://ict.usc.edu/projects/sergeant_blackwell/

Speech Interfaces toSpeech Interfaces toRobotic SystemsRobotic Systems

www.cellbots.com

User: Fly to the red house and photograph the area.

System: OK, I am preparing to take off.

Next generation systems explore Next generation systems explore ambitious domainsambitious domains

http://www.cellbots.com/

Speech Aware Speech Aware AppliancesAppliances

Speech aware appliances are Speech aware appliances are beginning to engage in limited dialogsbeginning to engage in limited dialogs

Interactive dialogs / Interactive dialogs / disambiguation are disambiguation are required by multi-field required by multi-field queries, ambiguity in queries, ambiguity in resultsresults

Expected What user actually said

Play artist Glenn Miller Glenn Miller, jazz

Play song all rise All rise, I guess, from blues

Human-Human vs. Human Human-Human vs. Human Machine SpeechMachine Speech

Is recognition performance the limiting factor?Is recognition performance the limiting factor?

Challenges exist in computationally describing Challenges exist in computationally describing conversational phenomena, for instanceconversational phenomena, for instance Evolving discourse structure. Consider answering a Evolving discourse structure. Consider answering a

question with a question.question with a question. Turn taking. Auditory cues (let alone gesture) are Turn taking. Auditory cues (let alone gesture) are

important – listen to two speakers competing for the important – listen to two speakers competing for the conversational floor.conversational floor.

Grounding. Prosody and intonation contours indicate Grounding. Prosody and intonation contours indicate our level of understanding.our level of understanding.

Research in SDS architectures address frameworks Research in SDS architectures address frameworks to capture the above – long way to go before we to capture the above – long way to go before we achieve human like conversational partnersachieve human like conversational partners

Other issues: SDS lack ability to effectively Other issues: SDS lack ability to effectively communicate their capabilities and limitations as communicate their capabilities and limitations as conversational partnersconversational partners

An Architecture for a Virtual An Architecture for a Virtual LibrarianLibrarian

Domain of interest: The Andrew Heiskell Domain of interest: The Andrew Heiskell Braille and Talking Book LibraryBraille and Talking Book Library Ability to browse and order books by phone Ability to browse and order books by phone

(there’s 70,000 of them!)(there’s 70,000 of them!) Callers have relatively disfluent speech. Callers have relatively disfluent speech. Anticipate poor recognizer performance.Anticipate poor recognizer performance.

The CMU Olympus FrameworkThe CMU Olympus Framework a freely available, actively developed, open a freely available, actively developed, open

source collection of dialog system source collection of dialog system componentscomponents

Origins in the earlier Communicator projectOrigins in the earlier Communicator project

The Olympus The Olympus ArchitectureArchitecture

Pipeline format, subsequent layers increase abstraction.Pipeline format, subsequent layers increase abstraction.

Signals to words, words to concepts, concepts to actionsSignals to words, words to concepts, concepts to actions

Detail: Hub Detail: Hub ArchitectureArchitecture

Deployed - and almost Deployed - and almost deployed ;) - Olympus deployed ;) - Olympus

SystemsSystemsSystem Domain Users Interaction Vocab

Lets Go Public!

Pittsburg Bus Route Information

General public

Information access (system initiative), background noise

2000 words

Team Talk

Robot Coordination and Control – Treasure hunting

Grad students / researchers

Multi-participant command and control

500 words

CheckItOut

Virtual Librarian for the Andrew Heiskell Library

Elderly, vision impaired library patrons

Information access (mixed initiative), disfluent speech

Variable - +/- 10,000 words

Speech recognitionSpeech recognition

Why ASR is Difficult for Why ASR is Difficult for SDSSDS

A SDS must accommodate variability in…A SDS must accommodate variability in…

EnvironmentsEnvironments Background noise, cell phone interference, VOIPBackground noise, cell phone interference, VOIP

Speech productionSpeech production Disfluency, false starts, filled pauses, repeats, Disfluency, false starts, filled pauses, repeats,

corrections, accent, age, gender, differences corrections, accent, age, gender, differences between human-human and human-machine between human-human and human-machine speechspeech

The caller’s technological familiarityThe caller’s technological familiarity with dialog systems in general, and with a with dialog systems in general, and with a

particular SDS’s capabilities and constraints, particular SDS’s capabilities and constraints, callers often use OOV / out of domain conceptscallers often use OOV / out of domain concepts

The Sphinx Open Source The Sphinx Open Source Recognition Toolkit Recognition Toolkit

Pocket-sphinxPocket-sphinx Pocket-sphinx is efficient / runs on embedded devicesPocket-sphinx is efficient / runs on embedded devices Continuous speech, speaker independent recognition Continuous speech, speaker independent recognition

systemsystem Includes tools for language model compilation, Includes tools for language model compilation,

pronunciation, and acoustic model adaptationpronunciation, and acoustic model adaptation Provides word level confidence annotation, n-best Provides word level confidence annotation, n-best

listslists

Olympus supports parallel decoding engines / Olympus supports parallel decoding engines / modelsmodels Typically runs parallel acoustic models for male and Typically runs parallel acoustic models for male and

female speechfemale speech

http://cmusphinx.sourceforge.net/

http://cmusphinx.sourceforge.net/

Language and Acoustic Language and Acoustic ModelsModels

Sphinx supports statistical, class, and state based language Sphinx supports statistical, class, and state based language modelsmodels Statistical language models assign n-gram probabilities to Statistical language models assign n-gram probabilities to

word sequencesword sequences Class based models assign probabilities to collections of Class based models assign probabilities to collections of

terminals, e.g., “I would like to read <book>”terminals, e.g., “I would like to read <book>”

State based LM switching State based LM switching

limit the perplexity of the language model by constraining it limit the perplexity of the language model by constraining it to anticipated wordsto anticipated words

<confirmation / rejection>, <help>, <address>, <books><confirmation / rejection>, <help>, <address>, <books>

Olympus includes permissive-license WSJ Acoustic models (read Olympus includes permissive-license WSJ Acoustic models (read speech) for male and female speech, at 8khz and 16hkz speech) for male and female speech, at 8khz and 16hkz bandwidthbandwidth Tools for acoustic adaptationTools for acoustic adaptation

ASR introduces ASR introduces uncertaintyuncertainty

SDS architectures always operate on partial SDS architectures always operate on partial informationinformation Managing that uncertainty is one of the main Managing that uncertainty is one of the main

challengeschallenges

How How you say it often conveys as much you say it often conveys as much information as information as what what is said.is said.

Prosody, intonation, amplitude, durationProsody, intonation, amplitude, duration Moving from an acoustic signal to a lexical Moving from an acoustic signal to a lexical

representation implies information lossrepresentation implies information loss

Information provided to downstream componentsInformation provided to downstream components A lexical representation of the speech signal, with A lexical representation of the speech signal, with

acoustic confidence and language model fit scoresacoustic confidence and language model fit scores An N-best listAn N-best list

Spoken Language Spoken Language UnderstandingUnderstanding

From words to From words to conceptsconcepts

SLU: the task of extracting meaning from SLU: the task of extracting meaning from utterancesutterances Dialog acts (the overall intent of an utterance) Dialog acts (the overall intent of an utterance) Domain specific concepts: frame / slotsDomain specific concepts: frame / slots

Challenge for the library domain – the words in Challenge for the library domain – the words in the 70k titles cover a subset of conversational the 70k titles cover a subset of conversational English! Vocabulary confusability.English! Vocabulary confusability. Very difficult under noisy conditionsVery difficult under noisy conditions

““Does the library have Does the library have Hitchhikers Guide to the Hitchhikers Guide to the GalaxyGalaxy by by Douglas AdamsDouglas Adams on on audio cassetteaudio cassette?”?”

Dialog Act Book Request

Title The Hitchhikers Guide to the Galaxy

Author Douglas Adams

Media Audio Cassette

SLU Challenges faced SLU Challenges faced by SDSby SDS

There are many, many possible ways to say the There are many, many possible ways to say the same thingsame thing How can SDS designers anticipate all of them?How can SDS designers anticipate all of them?

SLU can be greatly simplified by constraining what SLU can be greatly simplified by constraining what the user can say (and how they can say it!)the user can say (and how they can say it!) But.. results in a less habitable, clunky conversation. But.. results in a less habitable, clunky conversation.

Who wants to chat with a system like that?Who wants to chat with a system like that?

Recognizer error, background noise resulting in Recognizer error, background noise resulting in indels (insertions / substitutions / deletions), word indels (insertions / substitutions / deletions), word boundary detection problemsboundary detection problems

Language production phenomena: disfluency, false Language production phenomena: disfluency, false starts, corrections, repairs are difficult to parsestarts, corrections, repairs are difficult to parse

Meaning spans multiple speaker turnsMeaning spans multiple speaker turns

Semantic grammarsSemantic grammars

Frames, concepts, Frames, concepts, variables, terminals variables, terminals

Domain independent Domain independent conceptsconcepts [Yes], [No], [Help], [Yes], [No], [Help],

[Repeat], [Number][Repeat], [Number]

Domain dependent Domain dependent conceptsconcepts [Title], [Author], [Title], [Author],

[BookOnTape], [Braille][BookOnTape], [Braille]

The pseudo corpus LM The pseudo corpus LM tricktrick

[Quit][Quit]

(*THANKS *good bye)(*THANKS *good bye)

(*THANKS goodbye)(*THANKS goodbye)

(*THANKS +bye)(*THANKS +bye)

;;

THANKSTHANKS

(thanks *VERY_MUCH)(thanks *VERY_MUCH)

(thank you *VERY_MUCH)(thank you *VERY_MUCH)

VERY_MUCHVERY_MUCH

(very much)(very much)

(a lot)(a lot)

;;

Semantic parsersSemantic parsers

Phoenix parses the incoming stream of recognition Phoenix parses the incoming stream of recognition hypotheseshypotheses

Phoenix maps input sequences of words to semantic Phoenix maps input sequences of words to semantic framesframes A frame is a named set of slots, where slots represent A frame is a named set of slots, where slots represent

pieces of related informationpieces of related information Each slot has an associated CFG Grammar, specifying Each slot has an associated CFG Grammar, specifying

word patterns that match the slotword patterns that match the slot Chart parsing selects the path which accounts for the Chart parsing selects the path which accounts for the

maximum number of terminalsmaximum number of terminals Multiple parses may be produced for a single utteranceMultiple parses may be produced for a single utterance

Aside: prior to dialog management, the selected slot Aside: prior to dialog management, the selected slot triggered a state table updatetriggered a state table update

Estimating confidence in Estimating confidence in a parsea parse

How are initial confidences assigned to concepts?How are initial confidences assigned to concepts? Helios (a confidence annotator) uses a logistic Helios (a confidence annotator) uses a logistic

regression model to score Phoenix parsesregression model to score Phoenix parses

This score reflects the probability of correct This score reflects the probability of correct understanding, i.e. how much the system understanding, i.e. how much the system trusts that the current semantic trusts that the current semantic interpretation corresponds to the user’s interpretation corresponds to the user’s expressed intent expressed intent Features from different knowledge sourcesFeatures from different knowledge sources Acoustic confidence, language model score, parse Acoustic confidence, language model score, parse

coverage, dialog state, …coverage, dialog state, …

Belief updatingBelief updating

Grammars Generalize Grammars Generalize PoorlyPoorly

Are hand engineered grammars the way to Are hand engineered grammars the way to go?go? Requires expert linguistic knowledge to Requires expert linguistic knowledge to

constructconstruct Time consuming to develop and tuneTime consuming to develop and tune Difficult to maintain over complex domainsDifficult to maintain over complex domains Lacks robustness to OOV words and novel Lacks robustness to OOV words and novel

phrasingphrasing Lacks robustness to recognizer error and Lacks robustness to recognizer error and

disfluent speechdisfluent speech Noise tolerance is difficult to achieveNoise tolerance is difficult to achieve

Statistical methods (to Statistical methods (to the rescue?)the rescue?)

Language understanding as pattern Language understanding as pattern recognitionrecognition

Given word sequence Given word sequence WW, find the semantic , find the semantic representation of meaning M that has representation of meaning M that has maximum a posteriori probability P(M|W)maximum a posteriori probability P(M|W)

P(M): prior meaning probability, based on P(M): prior meaning probability, based on dialogue statedialogue state

P(W|M): assigns probability to word P(W|M): assigns probability to word sequence W given the semantic structuresequence W given the semantic structure )()|(maxarg)|(maxargˆ MPMWPWMPM

MM

Relative merits: Relative merits: Statistical vs. Knowledge Statistical vs. Knowledge

based SLUbased SLU Statistical methodsStatistical methods

Provide more robust coverage, especially for naïve Provide more robust coverage, especially for naïve users who respond frequently with OOV (out of users who respond frequently with OOV (out of vocabulary) wordsvocabulary) words

Require labeled training data (some efforts to Require labeled training data (some efforts to produce via simulation studies)produce via simulation studies)

Better for shallow understandingBetter for shallow understanding Excellent for call routing, question answering Excellent for call routing, question answering

(assuming the question is drawn from a predefined (assuming the question is drawn from a predefined set!)set!)

Semantic parsersSemantic parsers Provide a richer representation of meaningProvide a richer representation of meaning Require substantially more effort to developRequire substantially more effort to develop Assist in the develop of state based language Assist in the develop of state based language

modelsmodels

Voice searchVoice search

Database search with Database search with noisy ASR queriesnoisy ASR queries

Phonetic, partial Phonetic, partial matching database matching database queriesqueries

Frequently used in Frequently used in information retrieval information retrieval domains where Spoken domains where Spoken Dialog Systems must Dialog Systems must access a databaseaccess a database

ChallengesChallenges Multiple database fieldsMultiple database fields Confusability of conceptsConfusability of concepts

““The Language of Issa The Language of Issa Come Wars”Come Wars”

Return Confidence

The language of sycamores

.8

the language of clothes

.65

the language of threads

.51

The language of love

.40

PreprocessingPreprocessing

Dialog act classificationDialog act classification Request for book by author, by title, by ISBNRequest for book by author, by title, by ISBN Useful for grounding, error handling, Useful for grounding, error handling,

maintaining the situational framemaintaining the situational frame

Named entity recognition via statistical Named entity recognition via statistical tagging – as a preprocessor for voice searchtagging – as a preprocessor for voice search

In PracticeIn Practice

Institute for Creative Technologies: Virtual Institute for Creative Technologies: Virtual HumansHumans Question answering: maps user utterances to a Question answering: maps user utterances to a

small set of predefined answerssmall set of predefined answers Robust to high word error rate (WER) up to 50%Robust to high word error rate (WER) up to 50%

The AT&T Spoken Language Understanding The AT&T Spoken Language Understanding SystemSystem Couples statistical methods for call-routing with Couples statistical methods for call-routing with

semantic grammars for named-entity extractionsemantic grammars for named-entity extraction

Dialogue ManagementDialogue Management

From concepts to From concepts to actionsactions

How do SDS designers represent the dialog task?How do SDS designers represent the dialog task? hierarchal plans, state / transaction tables, markov processhierarchal plans, state / transaction tables, markov process

When should the user be allowed to speak?When should the user be allowed to speak? Tradeoffs between system and mixed initiative dialog Tradeoffs between system and mixed initiative dialog

management. management. A system initiative SDS has no uncertainty A system initiative SDS has no uncertainty about the dialog state… but is inherently clunky and rigidabout the dialog state… but is inherently clunky and rigid

How will the system manage uncertainty and error How will the system manage uncertainty and error handing?handing? Belief updating, domain independent error handling Belief updating, domain independent error handling

strategiesstrategies

Raven Claw: a two tier dialog management architecture Raven Claw: a two tier dialog management architecture which decouples the domain specific aspects of dialog which decouples the domain specific aspects of dialog control from belief updating and error handlingcontrol from belief updating and error handling The idea is to generalize dialog management framework The idea is to generalize dialog management framework

across domainsacross domains

Dialogue Task Specification, Agenda, and ExecutionDialogue Task Specification, Agenda, and Execution

Distributed error Distributed error handlinghandling

Error recovery strategiesError recovery strategiesError Handling Strategy (misunderstanding)

Example

Explicit confirmation Did you say you wanted a room starting at 10 a.m.?

Implicit confirmation Starting at 10 a.m. ... until what time?

Error Handling Strategy (non-understanding)

Example

Notify that a non-understanding occurred

Sorry, I didn’t catch that .

Ask user to repeat Can you please repeat that?

Ask user to rephrase Can you please rephrase that?

Repeat prompt Would you like a small room or a large one?

Goal is to avoid non-understanding cascades – the farther the dialog gets off track, the more difficult it is to recover

Statistical Approaches to Statistical Approaches to Dialogue ManagementDialogue Management

Is it possible to learn a Is it possible to learn a management policy from a management policy from a corpus?corpus?

Dialogue may be modeled Dialogue may be modeled as Partially Observable as Partially Observable Markov Decision ProcessesMarkov Decision Processes

Reinforcement learning is Reinforcement learning is applied (either to existing applied (either to existing corpora or through user corpora or through user simulation studies) to simulation studies) to learn an optimal strategylearn an optimal strategy

Evaluation functions Evaluation functions typically reference the typically reference the PARADISE framework – PARADISE framework – taking into account taking into account objective and subjective objective and subjective criteriacriteria

Interaction Interaction managementmanagement

Turn takingTurn taking

Mediates between the discrete, symbolic Mediates between the discrete, symbolic reasoning of the dialog manager, and the reasoning of the dialog manager, and the continuous real-time nature of user interactioncontinuous real-time nature of user interaction

Manages timing, turn-taking, and barge-inManages timing, turn-taking, and barge-in Yields the turn to the user should they interruptYields the turn to the user should they interrupt Prevents the system from speaking over the userPrevents the system from speaking over the user

Notifies the dialog manager ofNotifies the dialog manager of Interruptions and incomplete utterancesInterruptions and incomplete utterances New information provided while the DM is thinkingNew information provided while the DM is thinking

Natural Language Natural Language Generation and Speech Generation and Speech

SynthesisSynthesis

NLG and Speech NLG and Speech SynthesisSynthesis

Template based, e.g., for explicit error handling Template based, e.g., for explicit error handling strategiesstrategies Did you say <concept>? Did you say <concept>? More interesting cases in disambiguation dialogsMore interesting cases in disambiguation dialogs

A TTS synthesizes the NLG outputA TTS synthesizes the NLG output The audio server allows interruption mid utteranceThe audio server allows interruption mid utterance

Production systems incorporateProduction systems incorporate Prosody, intonation contours to indicate degree of Prosody, intonation contours to indicate degree of

certaintycertainty

Open source TTS frameworksOpen source TTS frameworks Festival - Festival - http://www.cstr.ed.ac.uk/projects/festival/ Flite - Flite - http://www.speech.cs.cmu.edu/flite/

http://www.cstr.ed.ac.uk/projects/festival/

http://www.speech.cs.cmu.edu/flite/

Putting it all togetherPutting it all together

CheckItOut ScenariosCheckItOut Scenarios

Future challengesFuture challenges

Multi-participant conversationsMulti-participant conversations How does each system identify who has the How does each system identify who has the

conversation floor and who is the addressee for any conversation floor and who is the addressee for any spoken utterance? spoken utterance?

How can multiple agents solve the channel contention How can multiple agents solve the channel contention problem, i.e. multiple agents speaking over each problem, i.e. multiple agents speaking over each other?other?

Understand how objects, locations, and tasks come Understand how objects, locations, and tasks come to be described in language. to be described in language. Robots and humans will need to mutually ground their Robots and humans will need to mutually ground their

perceptions to effectively communicate about tasks. perceptions to effectively communicate about tasks.

ReferencesReferences

Alex Rudnicky et al. (1999) Creating natural Alex Rudnicky et al. (1999) Creating natural dialogs in the Carnegie Mellon Communicator dialogs in the Carnegie Mellon Communicator system. Eurospeech.system. Eurospeech.

Gupta, N. et al. (2006). The AT&T spoken language Gupta, N. et al. (2006). The AT&T spoken language understanding system. IEEE Transactions on understanding system. IEEE Transactions on Audio, Speech, and Language Processing.Audio, Speech, and Language Processing.

Dan Bohus. (2007). Error awareness and recovery Dan Bohus. (2007). Error awareness and recovery in conversational spoken language interfaces, PhD in conversational spoken language interfaces, PhD Thesis, Carnegie Mellon University,Thesis, Carnegie Mellon University,

Dan Bohus and Eric Horvitz. (2009) Learning to Dan Bohus and Eric Horvitz. (2009) Learning to Predict Engagement with a Spoken Dialog System Predict Engagement with a Spoken Dialog System in Open-World Settings. SIGDIAL.in Open-World Settings. SIGDIAL.

Thanks! Questions?Thanks! Questions?

Spoken Dialog System Architecture Joshua Gordon CS4706.

Documents

Transcript of Spoken Dialog System Architecture Joshua Gordon CS4706.