SpeechBuilder: Facilitating Spoken Dialogue System Creation

L C S

SpeechBuilder: Facilitating Spoken Dialogue System Creation

Eugene Weinstein

Project Oxygen Core Team

MIT Laboratory for Computer Science

[email protected]

Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003

• Developing robust, mixed-initiative spoken dialogue systems is difficult

– Complex systems can be created by human-language technology experts

SpeechBuilder

Hub

SpeechSynthesis

SpeechSynthesis

LanguageGeneration

LanguageGeneration

DialogueManagement

DialogueManagement

ContextResolution

ContextResolution

Language ProcessingLanguage

Processing

SpeechRecog.

SpeechRecog.

DatabaseServer

DatabaseServerAudioAudio

Bridging the Experience Gap

• SpeechBuilder aims to help novices rapidly create speech-based systems

– Uses intuitive methods for specifying domain-specific constraints

– Automatically configures HLT components using MIT GALAXY architecture

* Leverages future technical advances

* Encourages research on portability

– Novice developers must overcome a considerable technical challenge


SpeechBuilderServer

SpeechBuilderServerHub

CGI ParameterGeneration


SpeechRecognition

SpeechRecognition

SpeechSynthesisSpeech

Synthesis


Processing

AudioServerAudioServer

HTTP

• Gives developer total control over application functionality

DeveloperApplicationDeveloper

Application

• Communication with Galaxy via simple HTTP protocol

“Turn on the lights in the kitchen”

action=set&frame=(object=lights, room=kitchen,value=on)

“Show me the banks on Main Street”

action=identify&frame=( object=(type=bank, on=(street=Main, ext=Street)))

Baseline Configuration


• Still gives developer total control over application functionality

• Frame Relay server exposes Galaxy meaning representation to app


Application

“Turn on the lights in the kitchen”

{c turn_management

:parse_frame {c turn

:object “lights” :room “kitchen”

:value “on”}

“Show me the banks on Main Street”{c turn_management :parse_frame {c identify “type” bank :pred {p :on {:street “Main”

:ext “Street”}}}

Modified Baseline Configuration (this class)

Frame RelayServer

Frame RelayServerHub



SpeechRecognition

SpeechRecognition


Synthesis


Processing


TCP SocketSemantic

Frame


• For a speech-based interface to structured data• No programming required; specify table(s) and constraints

DatabaseServer

DatabaseServerHub

LanguageGenerationLanguage

Generation

SpeechRecognition

SpeechRecognition

DiscourseResolutionDiscourseResolution


SynthesisDialogue

ManagementDialogue

Management


Processing

I/OServer

I/OServer


AudioServerAudioServer INFO

Database Access Configuration **


Step 1: Off-line creation and compilation

Hub

NLGNLG

ASRASR DiscoursDiscours

TTSTTS DialogDialog

NLUNLU

Audio

Audio SBSB

Query

Response

Step 2: On-line deployment

INFO

INFO

Dialog

NLG

HUBNLU

DiscASR

Upload

Compile

Creating a Speech-Based Application


AudioServer

AudioServer

• Telephone or lightweight audio server

DatabaseServer

DatabaseServer

• Accesses back-end database


Processing

• N-best interface with ASR

• Grammar from attributes & actions

• Backs off to concept spotting

ContextResolution

ContextResolution

• New component performs concept inheritance & masking

• Processes ‘E-form’

DialogueManagement

DialogueManagement

• Generic server handles interactionSpeech

Synthesis

SpeechSynthesis

• Commercial product

LanguageGeneration

LanguageGeneration

• Generates ‘E-form’, SQL, & responses

• Default entries made

• Galaxy programmable hub controls interactions between all components

Hub

Human Language Technologies

SpeechRecognition

SpeechRecognition

• Generic acoustic models

• Unknown word model

• Class or hierarchical n-gram


• Some columns are used to access entries (e.g., Name)– Column entries must be incorporated into ASR & NLU

• Some columns are only used in responses (e.g., Phone)– Column names must be incorporated into ASR & NLU

Name Phone Email Office

Jim Glass x3-1640 [email protected] 603

Stephanie Seneff x3-0451 [email protected] 643

Victor Zue x3-8513 [email protected] 601a

“What is the phone number for Victor Zue?”

Extracting Database Information **


Knowledge Representation

• Concepts and actions form basis for understanding– Concepts become key/value entries in meaning representation

* city: Boston, New York… day: Monday, Tuesday

– Actions provide sentence-level patterns of specific queries

* “I want to fly from Boston to Taipei…” action=lookup_flight

– Action text can be bracketed to define hierarchical concepts **

* “I want to fly source=(from Boston) destination=(to Taipei)”

* source=Boston destination=Taipei

– Concepts and actions used to configure the following components

* Speech Recognition

* Natural Language Understanding

* Discourse

• Database columns define basic concepts– Column names can be grouped into concepts

* property: phone, email… weather: snow, rain…


• Concept usage can be fine-tuned to improve performance:**

• By default, concepts are used for language modeling, parsing grammar, and meaning representation

– For language modeling and parsing grammar only (i.e., no meaning)

– For keyword spotting only (i.e., no role in language modeling)

– For fine-grained language modeling with coarser meaning representation

rain

hailsnow weather: snow“Will it snow?”

sprinkles

flurriesshowers

breezy

rainysnowy

snowfallaccumulation

rainfall

snowstormthunderstorm

blizzard

weather: snow

Language Modeling and Understanding


Current Status

• SpeechBuilder has been operational for over two years

– Used by over 50 developers from MIT and elsewhere

– Used in undergraduate classes at MIT and Georgetown University

• ASR capabilities benchmarked against main systems

– Achieves same ASR performance as MIT Jupiter weather information system (6.8% word error rate on clean data) (phone #)

• Several prototype systems have been developed

– Information about faculty, staff and students at LCS and AI Labs (phone, email, room, voice messages, transfer, etc.)

– Application to control the various physical items in a typical office (lights, curtains, TV, VCR, projector, etc.)

– Others include TV schedules, real-time weather forecasts, hotel and restaurant information etc.

• SpeechBuilder used for initial design of many more complex domains


• Increase sophistication of discourse and dialogue manager to handle more complex dialogues

– Enable finer specification of discourse capabilities

– Add generic capabilities for times, dates, etc.

• Incorporate confidence scoring and implement unsupervised training of acoustic and language models

• Create functionality to allow developers to create domain-specific concatenative speech synthesis

• Create alternative methods of domain specifications to streamline development

– Advanced developers don’t necessarily use web interface

– Allow for more efficient automatic generation of SpeechBuilder domains

Ongoing and Future Work


Issam Bazzi

Scott Cyphers

Ed Filisko

Jim Glass

TJ Hazen

Lee Hetherington

Joe Polifroni

Stephanie Seneff

Michelle Spina

Eugene Weinstein

Jon Yi

Misha Zitser

Acknowledgements

L C S

SpeechBuilder Hands-on Activity

Eugene Weinstein

Project Oxygen Core Team

MIT Laboratory for Computer Science

[email protected]


Frame RelayServer

Frame RelayServerHub



SpeechRecognition

SpeechRecognition


Synthesis


Processing


TCP Socket

• Still gives developer total control over application functionality

• Frame Relay server exposes Galaxy meaning representation to app


Application

Modified Baseline Configuration (this class)

Semantic

Frame

Jaim


SpeechBuilder API

Galaxy Frame Relay

• Galaxy meaning representation provided through frame relay

• Applications connect via TCP sockets

• API provided in Perl, Python, and Java– This class: Python API

Python classgalaxy.server.Server

Application

Python classgalaxy.frame.Frame

galaxy.server.Server methods:Constructor(machine,port,ID)

connect()processMessage(blocking)

disconnect()

galaxy.frame.Frame methods:getAction()

getAttribute(attr_name)getText()toString()

Python

API

TCPSock

et

SpeechBuilder: Facilitating Spoken Dialogue System Creation

Documents

Transcript of SpeechBuilder: Facilitating Spoken Dialogue System Creation