David DeVault CSCI 599 Special Topic: Natural Language Dialogue … · Speech Text Semantic Natural...

Natural language understanding in dialogue systems

David DeVault

CSCI 599 Special Topic: Natural Language Dialogue Systems

Spring 2013

Overview

● What is NLU in dialogue systems?● Automatic Speech Recognition● Possible inputs to an NLU module● Possible outputs from an NLU module● NLU sub-problems● NLU-related special topics

What is NLU in dialogue systems?

Speech

Text SemanticRepresentation

SemanticRepresentation

TextSpeech

Natural Language

Understanding (NLU)

AutomaticSpeech

Recognition (ASR)

Dialogue Manager

(DM)

Natural Language Generation

(NLG)

Text-to-Speech Synthesis

(TTS)

Why is NLU challenging in dialogue systems?

Speech TextSemanticRepresentationNatural

Language Understanding

(NLU)

AutomaticSpeech

Recognition (ASR)

● Presence of ASR errors

– ASR errors mean the NLU module does not start from an accurate transcript of what the user said

– Depending on the dialogue system, domain, and other details, an ASR Word Error Rate (WER) of 0.1 to 1.0 (or higher!) is common

we are prepared to give you guys generators for electricity

we apparently give you guys generators for a letter city

Speech:

ASR output:




(NLU)

AutomaticSpeech

Recognition (ASR)

● Ambiguity of natural language– The same surface expression can often mean several

different things– Ambiguity is pervasive in NLP

● syntactic ambiguity I saw the man with the binoculars● referential ambiguity The object is the beige diamond● word sense ambiguity I need to go to the bank● ambiguity in implication It's cold outside● ...

– An NLU module often needs to resolve ambiguity and identify the user's specific meaning

Example of a referential ambiguity

which object?




(NLU)

AutomaticSpeech

Recognition (ASR)

● Synonymy in natural language– Multiple surface expressions can often mean the

same thing from a system standpoint

– An NLU module often needs to map many different surface texts onto the same meaning

we are prepared to give you guys generators for electricityUser:

we'd be willing to give you generatorsUser:

I have power generators I could give youUser:

if generators are what you need I can help with thatUser:

...




(NLU)

AutomaticSpeech

Recognition (ASR)

● Context-sensitivity of language– The same surface expression can mean different

things in different contexts

– An NLU module may need to find a specific interpretation for expressions like it all and that in the context of the utterance

– Need to find a way to represent the relevant aspects of linguistic context

if you can get it all for $500, let's go with thatUser:




(NLU)

AutomaticSpeech

Recognition (ASR)

● Challenges with semantic representations– Human linguistic meaning can be very complex

and nuanced

– It can be hard to find a general-purpose semantic representation that is practical for use in your specific system

● The semantic distinctions that matter tend to vary from application to application

● System builders often create idiosyncratic representations for their domains




(NLU)

AutomaticSpeech

Recognition (ASR)

● Phenomena of spontaneous speech are often more complex and “noisy” than written text– Disfluencies are common in spoken dialogue

● Filled pauses (uhh, umm, etc.)● Repairs● Restarts● Sentence fragments

– Overlapping speech can occur

– Speaker turns can be complex (multiple meanings expressed)

Example of spontaneous dialogue (from Meteer 1995, Switchboard)

Overview


Automatic Speech Recognition

● ASR is the problem of automatically transcribing captured audio samples of human speech– Input: an audio sample

– Output: sequence of words

ASR at USC

● The Signal Analysis and Interpretation Laboratory (SAIL) engages in ASR research

● Detailed ASR techniques are covered in separate courses at USC – EE 519 Speech Recognition and Processing for Multimedia

– EE 619 Advanced Topics in Automatic Speech Recognition

● For an overview of ASR techniques, see Jurafsky and Martin (2009), Speech and Language Processing, 2nd edition, chapter 9

Use of ASR in dialogue systems

● Selection of a speech recognizer● Resources that may be needed

– Grammar

– Language model

– Phonetic dictionary

– Acoustic model

● Tool for capturing audio

Some readily available ASRs● CMU Sphinx4

– Open source ASR– http://cmusphinx.sourceforge.net/

● CMU Pocketsphinx– Open source ASR– http://cmusphinx.sourceforge.net/

● Google Speech API– Unofficial API that is free to use (supported in Chrome and Android)

● AT&T WATSON– ASR API that is free to use (Internet, HTML5, Android, iOS)– http://www.research.att.com/projects/WATSON/index.html

● Nuance– Fee-based software– www.nuance.com

● Microsoft Windows Speech Recognition– Available for Windows users

http://www.research.att.com/projects/WATSON/index.html

http://www.nuance.com/

ASR in the Virtual Human Toolkit

● Audio can be captured with AcquireSpeech– https://vhtoolkit.ict.usc.edu/display/VHTK/Acquire

Speech● The toolkit includes CMU's Pocketsphinx ASR

– Accessed through a wrapper library called pocketsphinx-sonic-server

– https://vhtoolkit.ict.usc.edu/display/VHTK/PocketSphinx+Wrapper

● Describes how to set the language model to a file you have created

https://vhtoolkit.ict.usc.edu/display/VHTK/AcquireSpeech

https://vhtoolkit.ict.usc.edu/display/VHTK/AcquireSpeech

https://vhtoolkit.ict.usc.edu/display/VHTK/PocketSphinx+Wrapper

https://vhtoolkit.ict.usc.edu/display/VHTK/PocketSphinx+Wrapper

Grammars

● Appropriate when you can easily circumscribe the syntactic form of user utterances

● Grammars are typically hand-crafted for the domain– Can be “brittle” when users deviate from the grammar

● Grammar format may vary by ASR– E.g. Sphinx takes JSGF format grammars (regular language)

– See http://www.w3.org/TR/jsgf/

grammar hello;public <greet> = (Good morning | Hello) ( Bhiksha | Evandro | Paul | Philip | Rita | Will );

From http://cmusphinx.sourceforge.net/wiki/tutoriallm

Statistical Language Models

● A language model (LM) assign probabilities to word sequences● Statistical LMs are trained using a collection of sample utterances

– Large text samples are preferable

● Generally used in open-ended domains (like conversation with virtual humans) where the syntactic form of user utterances is hard to circumscribe

● Language models can be created with readily available tools– SRI Language Modeling Toolkit

● http://www.sri.com/engage/products-solutions/sri-language-modeling-toolkit

– CMU Statistical Language Modeling Toolkit● http://www.speech.cs.cmu.edu/SLM_info.html

Phonetic Dictionary...

GENERATOR JH EH N ER EY T ER

GENERATOR'S JH EH N ER EY T ER Z

GENERATORS JH EH N ER EY T ER Z

GENEREUX ZH EH N ER OW

GENERIC JH AH N EH R IH K

GENERICALLY JH AH N EH R IH K L IY

GENERICS JH AH N EH R IH K S

GENERO JH AH N ER OW

GENEROSITY JH EH N ER AA S AH T IY

GENEROUS JH EH N ER AH S

GENEROUSLY JH EH N ER AH S L IY

GENES JH IY N Z

GENESCO JH EH N EH S K OW

GENESEE JH EH N AH S IY

GENESIS JH EH N AH S AH S

GENET JH EH N IH T

GENETIC JH AH N EH T IH K

GENETICALLY JH AH N EH T IH K L IY

...

excerpt from cmudict.0.7a_SPHINX_40

Acoustic models

● An acoustic model computes the probability of the observed acoustic features in an audio sample given a word (phone) sequence

● May be trained on 5-10 hours of speech (or much more)● Sphinx provides US English acoustic models for

– microphone and broadcast speech

– telephone speech

● Default acoustic models may work less well for different recording environments or accented English (UK English, Indian English)

ASR output options

● 1-best text hypothesis– Most common approach in implemented dialogue

systems

● N-best text hypotheses or word lattice– May offer advantages in certain domains

– See De Mori et al (2008)

Overview


Possible inputs to an NLU module

● ASR output (1-best, N-best, or lattice)● Dialogue context of the utterance

– simple summary of state of dialogue● state in a finite state model of the dialogue interaction● key aspects of dialogue history (as in Phoenix)● last system utterance

– information state representation of dialogue state (next week)

● can encode arbitrary aspects of dialogue history

● Other knowledge resources (e.g. database)

Overview


Possible outputs from an NLU module

● Different dialogue systems formalize the NLU problem in different ways

● Some common NLU outputs include:– slot values

– frames

– speech act labels

– speech act label + semantic content

NLU output: Slot values

● Slot values can be identified with pattern matching directly on the text input to NLU

● Slot-matching patterns may be regular expressions (FSAs) or context-free grammars (RTNs)

● Some words may be skipped between matched slots, improving robustness

● Slot-values may be combined with more complex NLU outputs to capture details like numeric values

… from Denver ...User:

Example: Phoenix parser

Figures from the Phoenix Parser User Manual

● Parser emits a frame = set of slot values like Origin.City: Denver

NLU output: Frames

● A frame is a collection of slot-values– Very flexible representation

– Can decompose the meaning of an utterance into the components that are meaningful to a dialogue system

– Can have hierarchical structure

– Values can be shared across slots

– The slot-value framework can be used to encode various kinds of semantic representations

● Frame outputs can be constructed in many ways– Slot-value parsing (as in the Phoenix parser)

– Data-driven statistical classification (as in mxNLU)

– Syntactic parsing + semantic rules

NLU output: Speech act labels

● Speech acts capture aspects of what utterances are used to do– Arose from a theoretical view of spoken utterances as

actions (Austin, 1962; Searle 1969)

● Taxonomies of speech act types may be defined● greeting, acknowledging, asserting, offering, etc.

● Speech act types are often used to represent what type of action a user utterance is making from the system's perspective


Speech act type: offer

An example speech act taxonomy● Switchboard SWBD-DAMSL (Jurafsky, Shriberg, and Biasca, 1997)

– http://groups.inf.ed.ac.uk/switchboard/dialactmanual.html

SWBD-DAMSL Example Count %

Statement-non-opinion Me, I'm in the legal department. 72,824 36%

Acknowledge (Backchannel) Uh-huh. 37,096 19%

Statement-opinion I think it's great 25,197 13%

Agree/Accept That's exactly it. 10,820 5%

Abandoned or Turn-Exit So, - 10,569 5%

Appreciation I can imagine. 4,633 2%

Yes-No-Question Do you have to have 4,624 2%

any special training?

Non-verbal [Laughter], [Throat_clearing] 3,548 2%

...

http://groups.inf.ed.ac.uk/switchboard/dialactmanual.html

Other speech act taxonomies

● You will sometimes see terms like dialogue acts, dialogue moves, conversation acts, etc. being used in similar ways

● For discussion and references, see– Bunt et al. (2010), Towards an ISO standard for

dialogue act annotation.

– Traum (2000), 20 Questions for Dialogue Act Taxonomies, in Journal of Semantics, 17(1):7-30

● Available special topic: dialogue act modeling

Discussion

● Questions about speech acts?

NLU output: speech act label + semantic content

● Dialogue systems generally need to know more than the type of speech act

● They also need the content of that speech act

NLU output:


Example: semantic information in SASO-EN

● Attributes and values are linked to a domain-specific ontology and task model (Traum 2003)

● Semantic representation includes – entities (kirk, power generators),

– locations,

– events (deliver),

– modal information (can),

– polarity

● Semantic representation is rich, but still not comprehensive, e.g. it currently does not support certain meanings:– conditional offers (if you give me X, I'll give you Y)

– complex statements (I'll give you X and I'll bring you Y)

– ...

we are prepared to give you

guys generators for electricityUser:

NLU in SASO-EN

● NLU is performed by a trained MaxEnt classifier called mxNLU● Input is a set of features extracted from ASR:

– bag-of-words (unigrams)

– bigrams

– each pair of words

– number of words

– can include syntactic features● Output is one of 136 complete frames

● Trained on 4,500 example utterances– Training set size is essential to performance

(Sagae, Christian, DeVault, Traum, NAACL 2009)

How does framebank size affect performance?

0 500 1000 1500 2000 2500 3000 3500 4000 45000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9NLU performance vs. framebank size

Maxent (train on ASR)Retriever (train on ASR)Bayesian (train on ASR)

Training set size

F-s

core

Challenges in annotating semantic frames

User's speech:

<S>.mood declarative

<S>.sem.speechact.type statement

<S>.sem.modal.deontic must

<S>.sem.type event

<S>.sem.event move

<S>.sem.source market

<S>.sem.agent captain-kirk

<S>.sem.theme clinic

<S>.sem.destination camp-near-us-base

Semantic interepration:

We have to move this clinic




<S>.sem.type event

<S>.sem.event move








<S>.sem.type event

<S>.sem.event move








<S>.sem.type event

<S>.sem.event move








<S>.sem.type event

<S>.sem.event move








<S>.sem.type event

<S>.sem.event move





(hundreds of frames)

We have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicWe have to move this clinicI have to move your clinic

(thousands of utterances)

The IORelator GUI● Input Output Relator

– Inputs = utterances

– Outputs = semantics

● Key Features– Can create views of data

using filters

– Multiple views can be positioned and linked together

– Easy to highlight and explore previous annotations

– Groups of similar utterances can be quickly annotated

(DeVault et al., 2010)

SASO-EN NLU evaluation criterion

<s>.mood declarative

<s>.sem.agent kirk

<s>.sem.event deliver

<s>.sem.modal.intention will

<s>.sem.speechact.type promise

<s>.sem.theme power-generators

NLU Hypothesis Correct Interpretation

Precision (P) = 0.67Recall (R) = 0.57F-Score = 2*P*R/(P+R) = 0.62

<s>.mood declarative

<s>.sem.agent kirk

<s>.sem.event deliver

<s>.sem.modal.possibility can

<s>.sem.speechact.type offer

<s>.sem.theme power-generators

<s>.sem.type event

NLU input: we apparently give you guys generators for a letter city

SASO-EN NLU performance

● Frame accuracy =~ .66● Precision = .81● Recall = .71● F-Score = .76

Other approaches to frame-based NLU

● Can use a syntactic parser + semantic rules– See e.g. De Mori et al. (2008)

● Can use a tagging model (e.g. CRF) to tag individual words in the word sequence with frame elements (slot values) – See e.g. Heintze et al. (2010)

● Can build an ensemble of classifiers for each slot– See e.g. Heintze et al. (2010)

● Many other approaches possible (e.g. MT-based)

Syntactic parser + semantic rules

f

from De Mori et al (2008)

Example: semantic information in TacQ

● The TacQ dialogue system (Gandhe et al, 2009) represents utterances using– speech act type

– (object, attribute, value) triple

● Example:

what is the name of the man you saw?User:

NLU output: wh-question object=strange-man attribute=name

Other representations of semantic content

● Many dialogue systems use hand-crafted representations of semantic content– Directly capture distinctions that are important to the system

– Avoid semantic representation challenges that aren't important to the system

– Sometimes use existing domain model or database representations

● General purpose resources can sometimes be used, e.g. PropBank, WordNet, FrameNet

● Some recent efforts toward distributional semantics (e.g. Mitchell & Lapata, 2010)– Try to capture semantics in a vector space model

– Composition operations are challenging

Overview


NLU sub-problems

● There are many additional sub-problems that may be solved by an NLU module– Parsing

– Resolving referring expressions

– Anaphora resolution

– Named entity detection

– Word sense disambiguation

– Segmentation

– ...

Overview


NLU-related special topics

● referring in dialogue● dialogue act modeling● dialogue act recognition● incremental speech processing● multi-modal dialogue

Assignment 2 Part 2

● (on the web)

David DeVault CSCI 599 Special Topic: Natural Language Dialogue … · Speech Text Semantic Natural...

Documents

Transcript of David DeVault CSCI 599 Special Topic: Natural Language Dialogue … · Speech Text Semantic Natural...