Multimodal Dialog
description
Transcript of Multimodal Dialog
Multimodal Dialog
Multimodal DialogMultimodal Dialog
1Intelligent Robot Lecture Note
Multimodal Dialog
2
Multimodal Dialog System
Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Dialog System
• A system which supports human-computer interaction over multiple different input and/or output modes.
► Input: voice, pen, gesture, face expression, etc.► Output: voice, graphical output, etc.
• Applications► GPS► Information guide system► Smart home control► Etc.
여기에서 여기로 가는 제일 빠른 길 좀 알려
줘 .voice
pen
3Intelligent Robot Lecture Note
Multimodal Dialog
Motivations
• Speech: the Ultimate Interface?► + Interaction style: natural (use free speech)
◦ Natural repair process for error recovery► + Richer channel – speaker’s disposition and emotional state (if
system’s knew how to deal with that..)► - Input inconsistent (high error rates), hard to correct error
◦ e.g., may get different result, each time we speak the same words.► - Slow (sequential) output style: using TTS (text-to-speech)
• How to overcome these weak points?► Multimodal interface!!
4Intelligent Robot Lecture Note
Multimodal Dialog
Advantages of Multimodal Interface
• Task performance and user preference• Migration of Human-Computer Interaction away from the desktop• Adaptation to the environment• Error recovery and handling• Special situations where mode choice helps
5Intelligent Robot Lecture Note
Multimodal Dialog
Task Performance and User Preference
• Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997]
► 10% faster task completion,► 23% fewer words, (Shorter and simpler linguistic constructions)► 36% fewer task errors,► 35% fewer spoken disfluencies,► 90-100% user preference to interact this way.
• Speech-only dialog system
Speech: Bring the drink on the table to the side of bed
• Multimodal dialog System
Speech: Bring this to herePen gesture:
Easy, Simplified
user utterance !
6Intelligent Robot Lecture Note
Multimodal DialogMigration of Human-Computer Interaction away
from the desktop• Small portable computing devices
► Such as PDAs, organizers, and smart-phones► Limited screen real estate for graphical output► Limited input no keyboard/mouse (arrow keys, thumbwheel)► Complex GUIs not feasible► Augment limited GUI with natural modalities such as speech and pen
◦ Use less space◦ Rapid navigation over menu hierarchy
• Other devices► Kiosks, car navigation system…
◦ No mouse or keyboard
Speech + pen gesture
7Intelligent Robot Lecture Note
Multimodal Dialog
Adaptation to the environment
• Multimodal interfaces enable rapid adaptation to changes in the environment
► Allow user to switch modes► Mobile devices that are used in multiple environments
• Environmental conditions can be either physical or social► Physical
◦ Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input
◦ Brightness: Bright light in outdoor environment can limit usefulness of graphical display
► Social◦ Speech many be easiest for password, account number etc, but in public
places users may be uncomfortable being overheard Switch to GUI or keypad input
8Intelligent Robot Lecture Note
Multimodal Dialog
Error Recovery and Handling
• Advantages for recovery and reduction of error:► Users intuitively pick the mode that is less error-prone.► Language is often simplified.► Users intuitively switch modes after an error
◦ The same problem is not repeated.◦ Multimodal error correction
► Cross-mode compensation - complementarity◦ Combining inputs from multiple modalities can reduce the overall error rate◦ Multimodal interface has potentially
9Intelligent Robot Lecture Note
Multimodal Dialog
Special Situations Where Mode Choice Helps
• Users with disability• People with a strong accent or a cold• People with RSI• Young children or non-literate users• Other users who have problems when handle the standard
devices: mouse and keyboard
• Multimodal interface let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities.
10Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Dialog System Architecture
• Architecture of QuickSet []► Multi-agent architecture
VR/AR InterfacesMAVEN
BARS
Facilitatorrouting, triggering, dispatching,
Facilitatorrouting, triggering, dispatching,
Inter-agent Communication Language
Sketch/ Gesture
ICL — Horn Clauses
Speech/TTS
NaturalLanguage
MapInterface
MultimodalIntegration
Simulators
WebSvcs(XML, SOAP, …)
Other Facilitators Databases
CORBAbridge
Other userinterfaces
Java-enabledWeb pages
COMobjects
11Intelligent Robot Lecture Note
Multimodal Dialog
12
Multimodal Language Processing
12Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Need to resolve references (what the user is referring to) across
modalities.► A user may refer to an item in a display by using speech, by pointing,
or both► Closely related with Multimodal Integration
여기에서 여기로 가는 제일 빠른 길 좀 알려 줘 .
voice
pen13Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Finds the most proper referents to referring expressions. [Chai et al.,
2004]◦ Referring expression
– Refer to a specific entity or entities– Given by a user’s inputs (most likely in speech inputs)
◦ Referent– An entity which the user refers
◦ Referent can be an object that is not specified by current utterance.
Speech
Gesture
여기
g1 g2
여기
여기에서 여기로 가는 가장 빠른 길 좀 알려줘
Object 버거킹 롯데 백화점
14Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Hard case
◦ Multiple and complex gesture inputs.◦ E.g.) in information guide system
Speech
Gesture
이것들
g1 g2
이거
이거랑 이것들이랑 가격 좀 비교 해 줄래
Time
g3
Speech
Gesture
이것들
g1 g2
이거
Time
g3
?
User: 이건 가격이 얼마지 ? ( 물건 하나를 선택한다 )
System: 만 오천원 입니다 .
User: 이거랑 이것들이랑 가격 좀 비교 해 줄래 ( 물건 세 개를 선택 한다 )
15Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Using linguistic theories to guide the reference resolution process.
[Chai et al., 2005]◦ Conversation Implicature◦ Givenness Hierarchy
► Greedy algorithm for finding the best assignment for a referring expression given a cognitive status.
◦ Calculate the match score between referring expressions and referent candidates.
– Matching score
◦ Finds the best assignments by using greedy algorithm
},,{
),(*)]|(*)|([),(DFGS
eoityCompatibileSPSoPeoMatch
object selectivity
Likelihood of status
compatibility measurement
16Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Integration
• Combining information from multiple input modalities to understand user’s intention and attention
► Multimodal reference resolution is a special case of multimodal integration
◦ Speech + pen gesture.◦ The case where pen gestures can express meaning of deictic or grouping
only.
Meaning Meaning
Multimodal Integration / Fusion
Combined Meaning
17Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Integration
• Issues:► Nature of multimodal integration mechanism
◦ Algorithmic – procedural◦ Parser / Grammars – Declarative
► Does approach treat one mode as primary?◦ Is gesture a secondary dependent mode?
– Multimodal reference resolution
► How temporal and spatial constraints are expressed► Common meaning representation for speech and gesture
• Two main approaches► Unification-based multimodal parsing and understanding [Johnston,
1998]► Finite-state transducer for multimodal parsing and understanding
[Johnston et al., 2000
18Intelligent Robot Lecture Note
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Parallel recognizers and “understanders”• Time-stamped meaning fragments for each stream• Common framework for meaning representation – typed feature
structures• Meaning fusion operations – unification
► Unification is an operation that determines the consistency of two pieces of partial information,
► And if they are consistent combines them into a single result◦ Whether a given gestural input is compatible with a given piece of spoken
input.◦ And if they are, combine them into a single result
► Semantic, and spatiotemporal constraints
• Statistical ranking• Flexible asynchronous architecture• Must handle unimodal and multimodal input
19Intelligent Robot Lecture Note
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Temporal Constraints [Oviatt et al., 1997]► Speech and gesture overlap, or► Gesture precedes speech by <= 4 seconds► Speech does not precede gesture
Given sequence speech1; gesture; speech2
Possible grouping speech1; (gesture; speech2)
Finding [Oviatt et al. 2004, 2005] -
Users have a consistent temporal integration style adapt
20Intelligent Robot Lecture Note
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Each unimodal inputs are represented as feature structure [Holzapfel et al., 2004]
► Very common representation in Comp. Ling. – FUG, LFG, PATR◦ e.g., lexical entries, grammar rules, etc.
► e.g., “please switch on the lamp”
• And there are some predefined rules for resolving the deictic reference and integrating multimodal inputs
Type
Type2
Attr1: val1Attr2: val2
Attr3: Attr4: val4
21Intelligent Robot Lecture Note
Multimodal Dialog
Unification-based multimodal parsing and understanding
• An example “Draw a line”
From speech (one of many hyp’s)
From pen gesture
Create_line
Object:
Location:
Color: greenLabel: draw a line
Create_line
Line
command
Location:
Line
command
Location:Point
Xcoord: 15487, Ycoord: 19547
ISA Create_line
Object: Color: greenLabel: draw a line
Create_line
Line
Coordlist [ (12143,12134), (12146,12134), … ]
Location:
Coordlist [ (12143,12134), (12146,12134), … ]
+
Cross-modecompensation
22Intelligent Robot Lecture Note
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Advantages of multimodal integration via typed feature structure unification
► Partiality► Structure sharing► Mutual Compensation (cross-mode compensation)► Multimodal discourse
23Intelligent Robot Lecture Note
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Mutual Disambiguation (MD)► Each input mode provides a set of scored recognition hypotheses► MD derives the best joint interpretation by unification of meaning
representation fragments► PMM = αPS + βPG + C
◦ Learn α, β and C over a multimodal corpus► MD stabilizes system performance in challenging environments
mm1
mm2
mm3
mm4
speech gesture object multimodal
s1
s2
s3
g1
g2
g3
g4
o1
o2
o3
24Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
• Modeled by a 3-tape finite state device► Speech and gesture stream (gesture symbols)► Their combined meaning (meaning symbols)
• Device take speech and gesture as inputs and create the meaning output.
• Simulated by two transducers► G:W aligning speech and gesture► G*W:M composite alphabet of speech and gesture symbols as
inputs and outputs meaning
• Speech and gesture input will be composed by G:W• Then G_W will be composed by G*W:M
25Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
• Representation of speech input modality► Lattice of words
• Representation of gesture input modality► Represent range of recognitions as lattice of symbols
phone numbers for
these two
tenrestaurants
american
newshow
areasel
2 restSEM(r12,r15)
hw
loc SEM(points…)
0
G
26Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
• Representation of combined meaning► Also represented as lattice
► Paths in meaning lattice are well-formed XML
<cmd> <info> <type>phone</type> <obj><rest>r12,r15</rest></obj> </info></cmd>
<cmd> <type> phone </type> <obj> <rest>
SEM(r12,r15) </rest> </obj> </cmd>
27Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
• Multimodal Grammar Formalism► Multimodal context-free grammar (MCFG)
◦ e.g., HEADPL restaurants:rest:<rest> ε:SEM:SEM ε: ε:</rest>► Terminals are multimodal tokens consisting of three components:
◦ Speech stream : Gesture stream : Combined meaning (W:G:M)► e.g., “put that there”
S ε:ε:<cmd> PUTV OBJNP LOCNP ε: ε:</ 층 >
PUTV ε:ε:<act> put:ε:put ε:ε:</act>
OBJNP ε:ε:<obj> that:Gvehicle:ε ε:SEM:SEM ε:ε:</obj>
LOCNP ε:ε:<loc> there:Garea:ε ε:ε:</loc>S
PUTV OBJNP LOCNP
<cmd> </cmd><act>put</act> <obj>v1</obj> <loc>a1</loc>
put that thereGvehicle v1 Garea a1
MeaningGestureSpeech
28Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
• Multimodal Grammar Example► Speech: email this person and that organization► Gesture: Gp SEM Go SEM► Meaning: email([ person(SEM) , org(SEM ) ])
S V NP ε:ε:])
NP DET N
NP NP CONJ NP
CONJ and:ε:,
V email:ε:email([
V page:ε:page([
DET this:ε:ε
DET that:ε:ε
N person:Gp:person( ε:SEM:SEM ε:ε:)
N organization:Go:org( ε:SEM:SEM ε:ε:)
N department:Gd:dept( ε:SEM:SEM ε:ε:)
organization:Go:org(
0 1
2 3 4
5 6email:ε:email([
page:ε:page([
this:ε:ε
that:ε:εand:ε:,
ε:ε:])
ε:ε:)
ε:SEM:SEM
department:Gd;dept(
person:Gp;person(
29Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
phone numbers for
these two
tenrestaurants
american
newshow
areasel
2 restSEM(r12,r15)
hw
loc SEM(points…)
0
G
<cmd> <type> phone
Speechlattice
Gesturelattice
Meaninglattice
3-Tape MultimodalFinite-state Device
integrationprocessing
30Intelligent Robot Lecture Note
Multimodal Dialog
Finite-state Multimodal Understanding
• An example
0 1
2 3 4
5 6email:ε:email([
page:ε:page([
this:ε:ε
that:ε:εand:ε:,
ε:ε:])
ε:ε:)
ε:SEM:SEM
department:Gd;dept(
person:Gp;person(
1 20email this
3 4person and
5 6that organization
1 20Gp SEM
3 4Go SEM
1 20email([ person(
3 4SEM )
5 6, org(
7 8SEM )
9])
Speechlattice
Gesturelattice
Meaninglattice
MultimodalGrammar
31Intelligent Robot Lecture Note
Multimodal Dialog
32
Robustness in Multimodal Dialog
32Intelligent Robot Lecture Note
Multimodal Dialog
Robustness in Multimodal Dialog
• Gain robustness via► Fusion of inputs from multiple modalities ► Using strengths of one mode to compensate for weaknesses of others—design
time and run time ► Avoiding/correcting errors► Statistical architecture ► Confirmation ► Dialogue context ► Simplification of language in a multimodal context ► Output affecting/channeling input
• Example approaches► Edit machine in FST based Multimodal integration and understanding► Salience driven approach to robust input interpretation► N-best re-ranking method for improving speech recognition performance
33Intelligent Robot Lecture Note
Multimodal Dialog
Edit Machine in FST based MM integration
• Problem of FST based MM integration - mismatch between the user’s input and the language encoded in the grammar
ASR: show cheap restaurants thai places in in chelsea
Grammar: show cheap thai places in chelsea
• How to parse it? determine which in-grammar string it is most like
Edits: show cheap ε thai places in ε chelsea
(restaurants and in is deleted)
To find this, employ the edit machine !
34Intelligent Robot Lecture Note
Multimodal Dialog
Handcrafted Finite-state Edit Machines
• Edit-based Multimodal Understanding – Basic edit► Transform ASR output so that it can be assigned a meaning by the
FST-based Multimodal Understanding model
► Find the string with the least costly number of edits that can be assigned an interpretation by the grammar
◦ λg: Language encoded in the multimodal grammar
◦ λs: String encoded in the lattice resulting from ASR
◦ ◦ : composition of transducers
geditsSs
s
minarg*
35Intelligent Robot Lecture Note
Multimodal Dialog
Handcrafted Finite-state Edit Machines
• Edit-based Multimodal Understanding – 4-edit► Basic edit is quite large and adds an unacceptable amount of latency
(5s on average).► Limited number of edit operations (at most 4)
36Intelligent Robot Lecture Note
Multimodal Dialog
Handcrafted Finite-state Edit Machines
• Edit-based Multimodal Understanding – Smart edit► Smart edit is a 4-edit machine + heuristics + refinements
◦ Deletion of SLM only words (not found in the grammar)– thai restaurant listings in midtown -> thai restaurant in midtown
◦ Deletion of doubled words– Subway to to the cloisters -> subway to the cloisters
◦ Subdivided cost classes ( icost, dcost 3 classes )– High cost: slot fillers (e.g. Chinese, cheap, downtown)– Low cost: dispensable words (e.g. please, would )– Medium cost: all other words
◦ Auto-completion of place names– Algorithm enumerates all possible shortening of places names– Metropolitan Museum of Art, Metropolitan Museum
37Intelligent Robot Lecture Note
Multimodal Dialog
Learning Edit Patterns
• User’s input is considered a “noisy” version of the parsable input (clean).
Noisy (S): show cheap restaurants thai places in in chelsea
Clean (T): show cheap ε thai places in ε chelsea
• Translating the user’s input to a string that can be assigned a meaning representation by the grammar
38Intelligent Robot Lecture Note
Multimodal Dialog
Learning Edit Patterns
• Noisy Channel Model for Error Correction► Translation probability
◦ Sg: string that can be assigned a meaning representation by the grammar
◦ Su: user’s input utterance
◦ From Markov assumption, (trigram)
– Where Su = Su1Su
2…Sun and Sg = Sg
1Sg2…Sg
m
► Word Alignment (Sui,Sg
i)
◦ GIZA++
),(maxarg* gug SSPSgS
),,,|,(maxarg* 2121 ig
ig
iu
iu
ig
iug SSSSSSPS
gS
39Intelligent Robot Lecture Note
Multimodal Dialog
Learning Edit Patterns
• Deriving Translation Corpus► Finite-state transducer can generate the input strings for given
meaning.► Training the translation model
corpus
meaning
string
MultimodalGrammar
GeneratedString
TargetString
Generate the stringsgiven meaning
Select the closest strings
40Intelligent Robot Lecture Note
Multimodal Dialog
Experiments and Results
• 16 first time users (8 male, 8 female).• 833 user interactions (218 multimodal / 491 speech-only / 124
pen-only)• Finding restaurants of various types and getting their names,
phone numbers, addresses.• Getting subway directions between locations.
• Avg. ASR sentence accuracy: 49%• Avg. ASR word accuracy: 73.4%
41Intelligent Robot Lecture Note
Multimodal Dialog
Experiments and Results
• Improvements on concept accuracy
ConAcc Rel Impr
No edits 38.9% 0%
Basic edit 51.5% 32%
4-edit 53.0% 36%
Smart edit 60.2% 55%
Smart edit (lattice) 63.2% 62%
MT edit 50.3% 29%
ConAcc
Smart edit 67.4%
MT edit 61.1%
Result of 6-fold cross validation
Result of 10-fold cross validation
42Intelligent Robot Lecture Note
Multimodal Dialog
A Salience Driven Approach
• Modify the language model score, and rescore recognized hypotheses
► By using the information of gesture input
► Primed Language model◦ W* = argmaxP(O|W)P(W)
43Intelligent Robot Lecture Note
Multimodal Dialog
A Salience Driven Approach
• “People do not make any unnecessary deictic gesture”► Cognitive theory of Conversation Implicature
◦ Speakers tend to make their contribution as informative as is required◦ And not make their contribution more informative than is required
• “Speech and gesture tend to complement each other”► When a speech utterance is accompanied by a deictic gesture,
◦ Speech input – issue commands or inquiries about properties of object◦ Deictic gesture – indicate the objects of interest
• Gesture as an earlier indicator to anticipate the content of communication in the subsequent spoken utterances
► 85% of time gestures occurred before corresponding speech unit
44Intelligent Robot Lecture Note
Multimodal Dialog
A Salience Driven Approach
• A deictic gesture can activate several objects on the graphical display
► It will signal a distribution of objects that are salient
Move this to here
Graphical display Salience weight
timegesturespeech
Salient Object A cup
45Intelligent Robot Lecture Note
Multimodal Dialog
A Salience Driven Approach
• Salient object ‘a cup’ is mapped to the physical world representation
► To indicate a salient part of representation◦ Such as relevant properties or task related to the salient objects.
• This salient part of the physical world is likely to be the potential content of speech
Move this to here
timegesturespeech
A cup
46Intelligent Robot Lecture Note
Multimodal Dialog
A Salience Driven Approach
47Intelligent Robot Lecture Note
• Physical world representation► Domain Model
◦ Relevant knowledge about the domain– Domain objects– Properties of objects– Relations between objects– Task models related to objects
◦ Frame-based representation– Frame: domain object– Frame elements: attributes and tasks related to the objects
► Domain Grammar◦ Specifies grammar and vocabularies used to process language inputs
– Semantics-based context free grammar– Non-terminal: semantic tag– Terminal: word (value of semantic tag)
– Annotated user spoken utterance– Relevant semantic information– N-grams
Multimodal Dialog
Salience Modeling
• Calculating a salience distribution of entities in the physical world► Salience value of entity at time tn is influenced by a joint effect from
◦ Sequence of gestures that happen before tn
nn tt gegeP gesturegiven entity ofy selectivitObject : )|(
at time gesture agiven at time valuesalience ofeight : )(
it
ntt
tgtwg
i
in
g)P(e(gα
g)P(e(gα)(eP
ee
m
ittt
m
itktt
kt
iin
iin
n
1
1
)|
)|
nkkt teePn
at time entity of valuesalience : )(
48Intelligent Robot Lecture Note
Multimodal Dialog
Salience Modeling
g)P(e(gα
g)P(e(gα)(eP
ee
m
ittt
m
itktt
kt
iin
iin
n
1
1
)|
)|
Summation of P(ek|g) for all gestures before time tn
Weighted by α
Normalizing factor:Summation of salience value of all entities at time tn
)2000
)(exp()( in
tt
ttg
in
The closer gesture has higher impact to
salience distribution
49Intelligent Robot Lecture Note
Multimodal Dialog
Salience Driven Spoken Language Understanding
• Maps the salience distribution to the physical world representation• Uses salient world to influence spoken language understanding• primes language models to facilitate language understanding
► Rescoring the hypotheses of speech recognizer by using primed language model score
50Intelligent Robot Lecture Note
Multimodal Dialog
Primed Language Model
• Primed language model is based on the class-based bigram model► Class : semantic and functional class for domain
◦ E.g.) this Demonstrative, price AttrPrice
► Modify the word class probability◦ Originally it measures the probability of seeing a word wi given a class ci
◦ It modified as the choice of word “wi” is dependent on the salient physical world– Which is represented as the salience distribution P(e)
◦ P(wi,ci|ek) and P(ci|ek) are not dependent on time ti
◦ can be estimated based on the training data
• Speech hypotheses are reordered according to primed language model.
)|()|()|( 11 iiiiii ccPcwPwwPClass transition probabilityWord class probability
ee
ktki
kiiii
k
ieP
ecP
ecwPcwP
)(
)|(
)|,()|(
51Intelligent Robot Lecture Note
Multimodal Dialog
Evaluation - WER
• Domain : real estate properties• Interface : speech + pen gesture• 11 users tested, five non-native speakers and six native speakers• 226 user inputs with an average of 8 words per utterance• Average WER reduction is about 12% (t=4.75, p<0.001)
User
Index
# of
inputs
# inputs
w/o gesture
Baseline
WER
1 21 0 0.287
2 31 0 0.335
3 27 0 0.399
4 10 0 0.680
5 8 1 0.200
6 36 0 0.387
7 18 0 0.250
8 25 1 0.278
9 23 0 0.482
10 11 0 0.117
11 16 3 0.255
52Intelligent Robot Lecture Note
Multimodal Dialog
Evaluation – Concept Identification
• Examples of improved case► Transcription: What is the population of this town► Baseline: What is the publisher of this time► Salience-based: What is the population of this town
► Transcription: How much is this gray house► Baseline: How much is this great house► Salience-based: How much is this gray house
Baseline Salience-based
Precision 80.3% 84.6%
Recall 75.7% 83.8%
F-measure 77.9% 84.2%
53Intelligent Robot Lecture Note
Multimodal DialogN-best re-ranking for improving speech recognition
performance• Using multimodal understanding feature
이것 좀 여기에 갖다 놔 .
이다 좀 여기에 갖 다 가
Speech Act: requestMain Goal: moveComponent Slots: Target.Loc : 여기
ASR
SLU
Missing the slot!!!
Source.item : 이것
errorSpeech
Pen
54Intelligent Robot Lecture Note
Multimodal DialogN-best re-ranking for improving speech recognition
performance• Using N-best ASR Hypotheses
► Rescore the hypotheses with many information► That are not available during speech recognition► We use multimodal understanding features
이다 좀 여기 갖 다 가이다 좀 여기 갖 다 줘이것 좀 여기 갖 다 가이것 좀 이것 갖 다 가…
이것 좀 여기 갖 다 가이것 좀 이것 갖 다 가이다 좀 여기 갖 다 가이다 좀 여기 갖 다 줘…
Re-ranking Modelwithmany
Features
55Intelligent Robot Lecture Note
Multimodal Dialog
Speech Recognizer Features
• Speech recognizer score: P(W|X). • Acoustic model score: P(X|W). • Language model score: P(W).• N-best word rate: To give more confidence to a particular word
which occurs in many hypotheses.
• N-best homogeneity: To give more weight to a word which appears in a higher ranked hypothesis, we weigh each word by the score of the hypothesis in which it appears.
list best N an in hypotheses of Number
wcontaining hypotheses of Number)(w rate wordbest N i
i
tN best lisses in an of hypothe scoresof Sum
ning wses contaiof hypothe scoresof Sum)(wy homogeneit N best i
i
56Intelligent Robot Lecture Note
Multimodal Dialog
SLU features
• CRF confidence Score: the confidence score of the SLU results.► Confidence score of speech act and main goal:
P(speech act|word sequence), P(main goal|word sequence)◦ Driven from a CRF formulation
◦ y: output variable◦ x: input variable◦ Z: normalization factor
◦ fk(yt-1,yt,x,t): arbitrary linguistic feature function (often binary-valued)
◦ λk: trained parameter associated with feature fk
T
t kttkk
x
txyyfZ
xyP1
1 ),,,(exp1
)|(
57Intelligent Robot Lecture Note
Multimodal Dialog
SLU features
• CRF confidence Score (cont.)► Confidence score of component slot
◦ yt: component slot
◦ xt: corresponding word
ty k tttk
k tttk
tt xyyf
xyyfxyConf
),,(
),,(),(
1
1
58Intelligent Robot Lecture Note
Multimodal Dialog
Multimodal Understanding Features
• Multimodal reference resolution score► Well recognized speech hypothesis tend to resolve well.► (a) well recognized► (b),(c) this bathroom this bad noon
► (b) this bad noon can not be
referring expression.
Second pen gesture has low
reference resolution score
► (c) this bad noon as a
referring expression but has
low reference resolution score
Clean this room and this bathroom
Gesture Input:
Clean this room and this bad noon
Gesture Input:
(a)
(b)
Clean this room and this bad noon
Gesture Input:
(c)
time
time
time
59Intelligent Robot Lecture Note
Multimodal Dialog
Experimental Setup
• Corpus► 617 Multimodal inputs
◦ 118 (speech + pen gesture) + 499 (speech only)◦ 3135 words, 5.08 words per utterance.◦ Vocabulary size: 396
• Speech Recognizer► HTK-based Korean speech recognizer was trained by MFCC 39
dimensional feature vectors.► Output 75 best lists
60Intelligent Robot Lecture Note
Multimodal Dialog
Experimental Result (WER)
• Comparison word error rate between baseline and N-best re-ranking model with variable feature set.
► Relative error reduction rate: 7.95 (%)► Re-ranking model has a significantly smaller word error rates than
that of baseline system. (p < 0.001)
WER (%)
baseline 17.74
+ Speech recognizer features 17.38
+ SLU features 16.43
+ Multimodal reference resolution features
16.33
61Intelligent Robot Lecture Note
Multimodal Dialog
Experimental Results (WER)
• Word error rates of a N-best re-ranking model with the varied size of an N
► If N is too large many noisy hypotheses
► If N is too small small candidate size and few clues to re-rank
62Intelligent Robot Lecture Note
Multimodal Dialog
Experimental Results (CER)
• Comparison concept error rate between baseline and N-best re-ranking model.
► Relative error reduction rate: 10.13 (%)► Re-ranking model has a significantly smaller concept error rates than
that of baseline system. (p < 0.01)
CER (%)
baseline 14.28
+ Speech recognizer features 13.81
+ SLU features 13.11
+ Multimodal reference resolution features
12.83
63Intelligent Robot Lecture Note
Multimodal Dialog
Reading List
• R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics interface,” Computer Graphics Vol. 14, no. 3, 262-270.
• J. Chai, S. Pan, M. Zhou, and K. Houck, 2002, Context-based Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).
• J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.
• J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.
64Intelligent Robot Lecture Note
Multimodal Dialog
Reading List
• J. Chai, S. Qu, A Salience Driven Approach to Robust Input Interpretation in Multimodal Conversational Systems, In Proceedings of the HLT/EMNLP 2005
• H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference on Multimodal Interfaces, (ICMI),
• M. Johnston, 1998. Unification-based multimodal parsing. Proceedings of the International Joint Conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics , 624-630.
• M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing and understanding. Proceedings of COLING-2000.
65Intelligent Robot Lecture Note
Multimodal Dialog
Reading List
• M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for multimodal dialogue systems. In Proceedings of ACL-2002.
• M. Johnston, S. Bangalore, Learning Edit Machines For Robust Multimodal Understanding, In Proceedings of the ICASSP 2006
• P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I. Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.
• S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.
66Intelligent Robot Lecture Note