W3C Multimodal Interaction Activity Dave Raggett W3C & Openwave Systems November 2002.
-
Upload
griselda-horton -
Category
Documents
-
view
222 -
download
0
Transcript of W3C Multimodal Interaction Activity Dave Raggett W3C & Openwave Systems November 2002.
W3C Multimodal Interaction Activity
Dave Raggett <[email protected]>
W3C & Openwave Systems
November 2002
Multimodal Interaction
● The Web started with purely textual content
● Evolved to include graphics, forms, tables, richer layout, and client and server side scripting
● W3C Voice Browser Activity launched in 1999 to enable voice interaction via any telephone
● New challenge is to combine speech with other modes of interaction
● W3C Multimodal Interaction Activity launched in February 2002
Web pages you can speak to and gesture at
Multimodal Devices● We have the hardware
● We have the networks (GPRS, W-CDMA, 1xRT, 802.11)
● We now need to develop the standards to combine different modes of interaction!
Bluetooth headset SonyEricsson P800
Multimodal Devices● Emerging market for Car based systems
— 3x4 inch color display plus control buttons
— Speech input and output
— GPS satelite location, maps, and navigation planning
— Wireless access to traffic information
— Entertainment (radio, CD, DVD, television)
— Control and monitoring of many aspects of car's systems
— Telephone, mobile information and communication services
What Modes and Why?
● Display
– Persistent
– Enliven with graphics/animation (branding)
– Limited on small screens
– Color displays hard to read in sunlight
● Speech
– Transient
– Good for small devices
– Enliven with personalities and audio effects
– Enables Speaker verification
– Environmental/cultural issues
What Modes and Why?
● Keys– Full size keyboards vs Phone keypads
● Pen/Stylus– Needs two hands vs thumbing on keypad
– Handwriting, drawings and gestures
– Special notations (mathematics, music, chemistry)
● Others– Cameras (photos and videos)
– Audio recording and playback
– System sensors (location and movement)
– Haptic devices (BMW iDrive)
– Head and arm movements, facial gestures
When to use a given Mode?● Social context
– Impolite
– Forbidden
– Culture specific
● Environmental Context– Disabilities
– Too noisy to hear or speak
– Too bright to easily read screen
– Need to keep hands/eyes free (driving)
● Personal Preferences– Static vs dynamic preferences
Device Capabilities● Thick Clients (desktops)
— High powered device capable of large vocabulary speech recognition, full sized keyboard and large high resolution display
— Application runs locally
● Thin Clients (mass market mobile phones)— Long battery life, limited processing power and memory, with small display,
keypad and audio.
— Application largely runs in the network.
● Medium Clients (PDA's and high end phones)— Moderate processing power and memory. Intermediate sized display,
stylus/keypad and audio
— Limited local recognition for speech and ink
— Application distributed between device and network
Use Cases● Airline reservation
— Making a reservation on your way to work: a form filling exercise where you get to choose the best modality according to whether you are on foot, in the car, on a train or have reached the office.
● Driving directions— Keep your mind on the road while getting directions: combines speech and vision,
for selections, directions and detours
● Name dialling— Friends and colleagues call you by name, with customized call handling depending
on time of day, what you are doing, and who is calling you. See caller before accepting the call, leave messages for ex's to never call you again, ...
● Finding a hotel— You've just arrived in town and need a hotel for the night, in this example you
speak to a human assistant who guides you through the available choices according to your preferences: combines speech with visual feedback .... Demo ....
MMI Framework
Input
Output
InteractionManager
ApplicationFunctions
User
Interaction Management● Interprets user input
— Taking into account internal state
— Integrating inputs from different modalities
— Anaphora, deixis and ellipsis
— Anne asked Edward to pass her the salt.
— I want that one! (pointing to a toy in a display cabinet)
— Detecting inconsistencies
● Determines how to respond— Update internal state and take appropriate actions
— May involve an explicit model of tasks and dialog history
● System vs User vs Mixed Initiative— System driven dialog leads user by the hand ...
— User driven dialog: user commands via speech, key strokes, mouse clicks, ...
— Mixed initiative: both parties can take control from each other
— Current Web applications can be considered as examples of mixed intiative
Speech is different!
● Speech is transient— You hear it and it's gone!
● Speech is uncertain— Recognizers make mistakes
● Knowing what to say— Carefully chosen prompts guide you to respond appropriately
— Grammars guide recognition based upon expected responses
— Critical for robust speaker independent recognition
● Spoken dialog patterns— e.g. navigation commands, tapered prompts, traffic lights model
● Visual feedback simplifies some aspects of dialog— You see at a glance whether what you said was recognized correctly
Why speech requires a fundamentallydifferent approach from graphical UI's
Single Document Model● Combine XHTML with markup for speech, ink etc.
— Building upon XHTML Modularization — XHTML + SRGS + SSML + XML Events etc.
— XHTML events trigger prompts and activate speech grammars— onload, onunload, onmouseover, onclick, onfocus, onblur, ...
— Simple markup for common cases:— Spoken commands to set focus, follow link, click button
— Break out to scripting when extra flexibility is needed
— Single document model is likely to be attractive to authors
● But there are problems ...— How to execute in a distributed environment?
— Lack of declarative support for common dialog behaviors
— Mixes up presentation and logic
Dual Document Model● XHTML in the device, VoiceXML in the network
— Coupled via events that trigger actions or notify changes— Application state is duplicated in both systems
— Events are passed as messages from one system to the other
— Users can choose whether to speak, or to use keypad or stylus— If user types value into form field, onchanged event is intercepted to pass
new value to VoiceXML interpreter to update the voice version of the form.
— If user uses her voice to fill out the field, the VoiceXML interpreter sends amessage to the XHTML interpreter to update the visual version of the form.
— Simpler to implement than single document approach— Builds upon existing implementations of XHTML and VoiceXML
● May be able to exploited shared data model
● But there are problems ...— Having XHTML and VoiceXML in separate documents
could make this harder for authors
— Mixes up presentation and logic
Separating Presentation and Logic● Web server scripts play important role in dialog
— XHTML and VoiceXML are commonly generated by server-side scripts— User is led through a sequence of dynamically constructed pages
— Application logic is represented within server-side scripts
● Declarative Dialogs— Declarative replacement for server-side scripts
— Decompose applications into smaller tasks
— Driven by goals and current application state— Form filling metaphor (Xforms)
— Plan generation and repair
— Presentation generated through template instantiation
— Gracefully copes with wide variety of device capabilities
● But ...— Perhaps too ambitious, and best left until we have more experience
The Key Role of Events● Events as the means for coupling components of
distributed Multimodal Systems— Functional specification as XML messages
— Agnostic wrt transport protocol (e.g. SIP Events)
— Support for Subscribe/Notify model
— Timestamps for synchronization (temporal/logical)
— Addressing (host/user/process/document/element)
— Efficient transmission via event chunking
● Author's perspective— Scripting vs declarative event handlers
— High level declarations for implicit handlers
— Default handlers for specific events
— High level modality independent vs low level modality dependent
— Events which cause actions vs events that inform of changes
Event Groups● XHTML Events
— onfocus, onblur, ochanged, onload, onunload, onmouseover, onmousout, onclick, onsubmit
● Document mutation events— Load page, set field value, set focus, Xupdate
● Environment, System status, Personal events— enable/disable modes, vehicle in motion/parked, location updates
● Speech and ink services— enable/disable grammars, prompts, results, recording, verification
● Multimedia services— play, pause, stop, rewind, fast forward
● Communication services— Sessions: start/stop, join/leave
— Services: allocate/deallocate
— Presence updates: devices/people
Using Events
● Thin Client and Network based dialog manager— Client runs GUI (XHTML Basic + CSS + Scripting)
— Bi-directional audio channel to network based speech engine
— XHTML events passed to network based dialog system
— Dialog system sends events to GUI to set focus, set field values, change to new page, or to mutate current page
● Author's perspective— Scripting vs declarative event handlers
— High level declarations for implicit handlers
— Default handlers for specific events
— High level modality independent vs low level modality dependent
Input● Natural Language Input
— Speech grammars + Acoustic models for speech recognition
— Ink grammars + stroke models for handwriting recognition
— Extracting semantic results via grammar rule annotations
— Results expressed as XML in application specific markup
● EMMA (Extensible Multi Modal Annotations)— Interface between recognizers and dialog managers
— EMMA annotates XML application data with:
• Data model (link to external definition)• Confidence scores• Time stamps• Alternative recognition hypotheses• Sequential and Partial results• May combine results from multiple input modes
— EMMA is based upon RDF (an RDF vocabulary)
Example● 'I want to fly to Boston'
— N-Best list of alternatives:• Destination: Boston, confidence 0.6
• Destination: Austin, confidence 0.4
— <destination>city name</destination>
<result xmlns:emma="http://www.w3.org/2002/emma" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:alt> <rdf:li emma:confidence="60"> <destination> Boston </destination> </rdf:li> <rdf:li emma:confidence="40"> <destination> Austin </destination> </rdf:li> </rdf:alt> </result>
Pen Input● Pen input has been with us for a long time
— Apple Newton
— Took off with success of Palm Organizer and Pocket CE
— Cell phones like Motorola Accompli A6188, Alcatel One Touch COM, Kycera pDQ, SonyEricsson R380, P800, Siemens Multimobile, Mitsubishi Trium, PC-ePhone
Motorola Alcatel Kycera SonyEricsson Siemens Mitsubishi PC-ephone
Pen Input● Uses
— Gestures
— Drawings
— Note taking
— Handwriting recognition
— Signature verification
— Specialized notations: Math, Music, Chemistry, ...
● Why develop a standard ink format?— Use of ink together with speech
— Server-side interpretation brings extra flexibility
— Passing ink to other people (drawings and notes)
● XML based format— InkXML contribution (IBM, Intel, Motorola, International Unipen Foundation)
Driving Directions Use Case
Name Dialing Use Case
Form Filling Use Case
Multimodal Requirements● Selecting/Constraining available modes
— Allowing user to select which mode are used by user/system
— Allowing system to constrain modes (road safety)
● Flexible use of output modes— Using ouput modes in a complementary fashion
— Using output modes for redundancy
— Using modes in a sequential fashion
● Flexible use of input modes— Allow input in choice of modes (voice or stylus/keypad)
— Allow for inputs combining more than one mode
— 'I want this one' while drawing circle on picture
— Opportunity for inconsistent inputs in different modes
— Saying yes while typing no
Multimodal Requirements● Support a wide range of device capabilities
● Allow for multiple users/devices/servers— Using a telephone in combination with a desktop
— Assisting communication between human users
— Local or Network based speech and ink engines
— Local or Network based dialog management
● Allow integrated and distributed architectures— Event handlers which are agnostic wrt source of event
— Modular markup supporting multiple architectures
● Increasing importance of SIP— SIP is to communications applications as HTTP is to the Web
— SIP allows scripted multi-device/multi-server sessions
— SIP Events: as basis for coupling multimodal systems
Is Multimodal of Genuine Value?● Won't offering a choice of modes confuse people?
— Convenience, e.g. ease of speaking versus thumbing input
— Changing situational context makes different modes attractive
— Legal requirements e.g. when driving
● Isn't this just technology for geeks?— No, it will be used by everyone
— Well designed, it will make services more accessible
— Multimodal is especially suited to mobile
— Long term it could radically change how we interact with computers
● Are the markets ready for multimodal?— Telecoms downturn has hurt rate of adoption of new technology
— Companies involved in W3C Multimodal working group: Avaya, Canon, Cisco, Comverse, France Telecom, Hewlett-Packard, IBM, Intel, Kirusa, Loquendo, Microsoft, Mitsubishi Electric, Motorola, NEC, Nokia, Nortel Networks, Nuance Communications, OnMobile Systems, Openwave, Opera Software, Philips Electronics, PipeBeach, Scansoft, Siemens, Snowshore Networks, Speechworks International, Sun Microsystems, T-Online International, Toyohashi University of Technology,V-Enable, VoiceGenie, and Voxeo
Research Challenges● More robust speech recognition
— Picking nuances of speech out of the background
— Understanding the sound field
— Modeling the microphone and acoustic environment
— Richer models for later stages of speech recognition
— Combining stochastic and linguistic knowledge
— Something in between speaker dependent and independent ASR
● Declarative approaches for separating presentation and behavior— The Web is just starting to separate presentation and data
— Plan based dialogs with tasks as objects? (TalkML2)
● Reducing the burden on the author— Applications as a constellation of tasks
— Combing task specific with general knowledge
— Natural language understanding and common sense skills
— Role of self awareness (dialog memory + long term memory)
— Soft human knowledge versus crisp symbolic knowledge (Semantic Web)