Speech Processing 11-492/18-492Speech Processing 11-492/18-492
Spoken Dialog SystemsConversing with machines
Spoken Dialog SystemsSpoken Dialog Systems
Not just ASR bolted onto TTSNot just ASR bolted onto TTS Different styles of interactionDifferent styles of interaction
Question/response systemsQuestion/response systems Mixed initiative systemsMixed initiative systems ““How May I Help You?” open questionsHow May I Help You?” open questions True conversational machine-human interactionTrue conversational machine-human interaction
SDS OverviewSDS Overview
IntroductionIntroduction Building simple dialog systemsBuilding simple dialog systems VoiceXMLVoiceXML
A language for writing systemsA language for writing systems Beyond tree-based systemsBeyond tree-based systems Beyond spoken languageBeyond spoken language Non-task-oriented systemsNon-task-oriented systems Real-world deployment considerationsReal-world deployment considerations
SDS ApplicationsSDS Applications
Information giving/requestInformation giving/request Flights, buses, stocks and weatherFlights, buses, stocks and weather Driving directionsDriving directions Answer questions, newsAnswer questions, news
TransactionalTransactional Reply your emailReply your email Credit card and bank enquiries, product purchaseCredit card and bank enquiries, product purchase
MaintenanceMaintenance Technical supportTechnical support Customer serviceCustomer service
SDS ApplicationsSDS Applications
EntertainmentEntertainment Game characters (NPC), toys, robotsGame characters (NPC), toys, robots
TutoringTutoring Math, scienceMath, science Language learningLanguage learning
Health careHealth care Depression screeningDepression screening Aphasia therapy Aphasia therapy
Dialog TypesDialog Types System initiativeSystem initiative
Form-filling paradigmForm-filling paradigm Can switch language models at each turnCan switch language models at each turn Can “know” which is likely to be saidCan “know” which is likely to be said
Mixed initiativeMixed initiative Users can go where they likeUsers can go where they like System or user can lead the discussionSystem or user can lead the discussion
Classifying:Classifying: Users can say what they likeUsers can say what they like But really only “N” operations possibleBut really only “N” operations possible E.g. AT&T? “How may I help you?”E.g. AT&T? “How may I help you?”
System Initiative System Initiative
Most commonMost common Machine controls the callMachine controls the call Few choices in the dialogFew choices in the dialog
Simple form filling:Simple form filling: What is your bank account numberWhat is your bank account number
Advantages:Advantages: You know what users will say (sort of)You know what users will say (sort of) Hard for user to get confusedHard for user to get confused Hard for system to get confusedHard for system to get confused Easy to buildEasy to build
Disadvantages:Disadvantages: Limited flexibility in interactionLimited flexibility in interaction Fixed dialog structureFixed dialog structure
Most reliable, but many turnsMost reliable, but many turns
System InitiativeSystem Initiative
Let’s Go Bus InformationLet’s Go Bus Information 412 268 3526 (Anytime)412 268 3526 (Anytime) Provides bus information for PittsburghProvides bus information for Pittsburgh
Tell MeTell Me Company getting others to build systemsCompany getting others to build systems Stocks, weather, entertainmentStocks, weather, entertainment
Mixed InitiativeMixed Initiative
User or system takes initiativeUser or system takes initiative More interesting dialogsMore interesting dialogs ““jump” through different parts of dialog statejump” through different parts of dialog state
AdvantagesAdvantages More realistic dialog More realistic dialog Can do more complex tasksCan do more complex tasks
DisadvantagesDisadvantages Can get confusingCan get confusing Can miss important partsCan miss important parts
Classification DialogsClassification Dialogs
Sort out from N thingsSort out from N things User says “anything” and system directs themUser says “anything” and system directs them ReceptionistReceptionist
I have a problem with my billI have a problem with my bill What’s the area code for MiamiWhat’s the area code for Miami Did you know I can see the beach from hereDid you know I can see the beach from here
AdvantagesAdvantages (Apparently) complex understanding(Apparently) complex understanding Solves a very common taskSolves a very common task
DisadvantagesDisadvantages Actually quite restrictiveActually quite restrictive Needs data to train fromNeeds data to train from Needs to be updated Needs to be updated
Beyond TelephonesBeyond Telephones
TelematicsTelematics Voice communication in carsVoice communication in cars CPS, music selection etc CPS, music selection etc
Web-based dialog systemsWeb-based dialog systems Robot InteractionRobot Interaction
Robot-robot and robot-human interactionRobot-robot and robot-human interaction Animated talking headAnimated talking head
Non-player characters – web agentsNon-player characters – web agents Speech to Speech translationSpeech to Speech translation CMU Skylar: integrating many dialog CMU Skylar: integrating many dialog
systemssystems
Team TalkTeam Talk
Using speech to control multiple robotsUsing speech to control multiple robots Robots have names and distinct voicesRobots have names and distinct voices They report to each other and to you in voiceThey report to each other and to you in voice
Other SDS Other SDS
Microsoft: Situated InteractionMicrosoft: Situated Interaction Talking Head that follows youTalking Head that follows you
CMU SV: AidasCMU SV: Aidas Restaurant recommendations in situRestaurant recommendations in situ
True conversationTrue conversation
Requires more than just speechRequires more than just speech Non-verbal noises: laughing, er, um, etcNon-verbal noises: laughing, er, um, etc Eye gazeEye gaze Proper timing (not waiting 500ms before Proper timing (not waiting 500ms before
speaker)speaker) Back-channelingBack-channeling MovementMovement Talking about nothingTalking about nothing
RoboreceptionistRoboreceptionist
Entrance to NSHEntrance to NSH Keyboard (no ASR)Keyboard (no ASR) TTS, face, movementTTS, face, movement Range finder to detect peopleRange finder to detect people Significant background Significant background
charactercharacter Mostly talks about nothingMostly talks about nothing
Personal Intelligent SystemsPersonal Intelligent Systems
Example: Apple Siri, Google Now, Microsoft Example: Apple Siri, Google Now, Microsoft Cortana, Amazon Echo, etc.Cortana, Amazon Echo, etc.
Hub of all applicationsHub of all applications ExtendableExtendable PersonalizationPersonalization Cross-LanguageCross-Language Cross-CulturalCross-Cultural Future: interface-> true companionFuture: interface-> true companion
Speech Processing 11-492/18-492Speech Processing 11-492/18-492
Spoken Dialog SystemsSDS components
Spoken Dialog SystemsSpoken Dialog Systems
More than just ASR and TTSMore than just ASR and TTS RecognitionRecognition Language understandingLanguage understanding Manipulation of utterancesManipulation of utterances Generation of new informationGeneration of new information Text generationText generation SynthesisSynthesis
SDS ArchitectureSDS Architecture
Language Generation
ASRLanguage
Understanding
Synthesis
Dialog Manager
Error Handling Strategies
Non Understanding
SDS InternalsSDS Internals
Language UnderstandingLanguage Understanding From words to structureFrom words to structure
Dialog ManagerDialog Manager State of dialog (who is talking)State of dialog (who is talking) Direction of dialog (what next)Direction of dialog (what next) References, user profile etcReferences, user profile etc Interaction of database/internetInteraction of database/internet
Language GenerationLanguage Generation From structure to wordsFrom structure to words
Language UnderstandingLanguage Understanding
Parsing of SPEECH not TEXTParsing of SPEECH not TEXT Eh, I wanna go, wanna go to Boston tomorrowEh, I wanna go, wanna go to Boston tomorrow If its not too much trouble I’d be very grateful if If its not too much trouble I’d be very grateful if
one might be able to aid me in arranging my one might be able to aid me in arranging my travel arrangements to Boston, Logan airport, travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you.at sometime tomorrow morning, thank you.
Boston, tomorrowBoston, tomorrow
Parsing: Output structureParsing: Output structure
““I wanna go to Boston, tomorrow”I wanna go to Boston, tomorrow” Destination: BOSDestination: BOS Departure: 20081028, AMDeparture: 20081028, AM Airline: unspecifeeAirline: unspecifee Special: unspecifeeSpecial: unspecifee
Convert speech to structureConvert speech to structure Sufficient for further processing/querySufficient for further processing/query
Interaction ExampleInteraction Example
Intelligent Agent
Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there.
find a cheap eating place oor taiwanese oood
User
SDS ProcessSDS Process
find a cheap eating place oor taiwanese oood
User
target
ooodpriceAMOD
NN
seeking
PREP_FORIntelligent
Agent
SDS ProcessSDS Process
User
target
ooodpriceAMOD
NN
seeking
PREP_FOR
Organized Domain Knowledge
Intelligent Agent
Ontology Induction
(semantic slot)
find a cheap eating place oor taiwanese oood
SDS ProcessSDS Process
User
target
ooodpriceAMOD
NN
seeking
PREP_FOR
Organized Domain Knowledge
Intelligent Agent
Ontology Induction
(semantic slot)
Structure Learning
(inter-slot relation)
find a cheap eating place oor taiwanese oood
SDS ProcessSDS Process
User
target
ooodpriceAMOD
NN
seeking
PREP_FORIntelligent
Agent
seeking=“find”target=“eating place”price=“cheap”oood=“taiwanese”
find a cheap eating place oor taiwanese oood
find a cheap eating place oor taiwanese oood
SDS ProcessSDS Process
User
target
ooodpriceAMOD
NN
seeking
PREP_FORIntelligent
Agent
seeking=“find”target=“eating place”price=“cheap”oood=“taiwanese”
Semantic Decoding
Automatic Slot IneuctionAutomatic Slot Ineuction
Chen et al. ASRU’13Chen et al. ASRU’13
can i have a cheap restaurant
Frame: capability
Frame: expensiveness
Frame: locale by use
Domain
DomainGeneral
slot candidate
32
can i have a cheap restaurant
Parsing vs Language ModelParsing vs Language Model
Language ModelLanguage Model Model what actually gets saidModel what actually gets said
Parsing Parsing Extract the information you wantExtract the information you want
Models *can* be sharedModels *can* be shared Only accept things in the grammarOnly accept things in the grammar Can be over limitingCan be over limiting
Neural Networks for SLUNeural Networks for SLU
RNN for Slot FillingRNN for Slot Filling
Step 1: word embedding Step 1: word embedding Step 2: short-term dependencies capturingStep 2: short-term dependencies capturing Step 3: long-term dependencies capturingStep 3: long-term dependencies capturing Step 4: different types of neural architectureStep 4: different types of neural architecture
http://deeplearning.net/tutorial/rnnslu.html#rnnsluhttp://deeplearning.net/tutorial/rnnslu.html#rnnslu
Mesnil et al. 2013
Interactive Learning for SLUInteractive Learning for SLU
Luis : Interactive machine learning for Luis : Interactive machine learning for language understandinglanguage understanding
Advantages: Advantages: Non-expert could add in knowledge in Non-expert could add in knowledge in feature engineeringfeature engineeringActive-learning reduces heavy labelingActive-learning reduces heavy labeling
https://www.luis.ai/ https://www.luis.ai/
Williams et al. 2016
Dialog ManagerDialog Manager
Maintain stateMaintain state Where are we in the dialogWhere are we in the dialog Whose turn is itWhose turn is it
Waiting for speakerWaiting for speaker Waiting for database query (stall user)Waiting for database query (stall user)
Deal with barge-inDeal with barge-in
Top Related