Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services...

131
Tutoria l Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007

Transcript of Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services...

Page 1: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

Tutorial

Developing and Deploying Multimodal Applications

James A. LarsonLarson Technical Services

jim @ larson-tech.com

SpeechTEK WestFebruary 23, 2007

Page 2: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 2

Developing and Deploying Multimodal Applications

What applications should be multimodal?

What is the multimodal application development process?

What standard languages can be used to develop multimodal applications?

What standard platforms are available for multimodal applications?

Page 3: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 3

Capturing Input from the User

Acoustic

Tactile

Visual

Microphone

Keypad

Keyboard

Pen

Joystick

Scanner

Still camera

Video camera

Speech

Key

Ink

GUI

Photograph

Movie

Mouse

Medium Input Device Mode

Page 4: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 4

Capturing Input From the User Multimodal

Acoustic

Tactile

Visual

Microphone

Keypad

Keyboard

Pen

Joystick

Scanner

Still camera

RFID

Speech

Key

Ink

GUI

Photograph

Gaze trackingGesture reco

Mouse

Medium Input Device Mode

Electronic

Video camera

Biometric

GPS

Digital data

Page 5: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 5

Presenting Output to the User

Acoustic

Visual

Speaker

Display

Speech

Text

Photograph

Movie

Medium Output Device Mode

Tactile Joystick Pressure

Page 6: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 6

Presenting Output to the User

Acoustic

Visual

Speaker

Display

Speech

Text

Photograph

Movie

Medium Output Device Mode

Tactile Joystick Pressure

Multimedia

Page 7: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 7

Multimodal and Multimedia Application Benefits

Provide a natural user interface by using multiple channels for user interactions

Simplify interaction with small devices with limited keyboard and display, especially on portable devices

Leverage advantages of different modes in different contexts

Decrease error rates and time required to perform tasks

Increase accessibility of applications for special users

Enable new kinds of applications

Page 8: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 8

Exercise 1

What new multimodal applications would be useful for your work?

What new multimodal applications would be entertaining to you, your family, or friends?

Page 9: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 9

Voice as a “Third Hand”

Game Commander 3

• http://www.gamecommander.com/

Page 10: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 10

Voice-Enabled Games

Scansoft’s VoCon Games Speech SDK

• http://www.scansoft.com/games/

• PlayStation® 2

• Nintendo® GameCube™

• http://www.omnipage.com/games/poweredby/

Page 11: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 11

Education

Tucker Maxon School of Oral Educationhttp://www.tmos.org/

Page 12: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 12

Education

Reading Tutor Projecthttp://cslr.colorado.edu/beginweb/reading/reading.html

Page 13: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 13

Multimodal Applications Developed by PSU and OHSU Students

Hands-busy

Troubleshooting a car’s motor

Repairing a leaky faucet

Tune musical instruments

Construction

Complex origami artifactProject book for children

Cooking—Talking recipe book

Entertainment

Child’s fairy tale bookAudio-controlled juke boxGames (Battleship, Go)

Page 14: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 14

Multimodal Applications Developed by PSU and OHSU Students (continued)

Data collection

Buy a carCollect health dataBuy movie ticketsOrder meals from a restaurantConduct banking businessLocate a businessOrder a computerChoose homeless pets from an animal shelter

AuthoringPhoto album tour

Education

Flash cards—Addition tables

Download Opera and the speech plug-inGo to www.larson-tech.com/mm-Projects/Demos.htm

Page 15: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 15

New Application Classes

Active listening

Verbal VCR controls: start, stop, fast forward, rewind, etc.

Virtual assistants

Listen for requests and immediately perform them

- Violin tuner- TV Controller- Environmental controller- Family-activity coordinator

Synthetic experiences

Synthetic interviewsSpeech-enabled gamesEducation and training

Authoring content

Page 16: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 16

Two General Uses of Multiple Modes of Input

Redundancy—One mode acts as backup for another mode

In noisy environments, use keypad instead of speech input.

In cold environments, use speech instead of keypad.

Complementary—One mode supplements another mode

Voice as a third hand

“Move that (point) to there (point)” (late fusion)

Lip reading = video + speech (early fusion)

Page 17: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 17

Potential Problems with Multimodal Applications

Voice may make an application “noisy.”

• Privacy and security concerns

• Noise pollution

Sometimes speech and handwriting recognition systems fail.

False expectations of users wanting to use natural language.

Page 18: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 18

Potential Problems with Multimodal Applications

Voice may make an application “noisy.”

• Privacy and security concerns

• Noise pollution

Sometimes speech and handwriting recognition systems fail.

False expectations of users wanting to use natural language.

Full natural language processing requires:• Knowledge of outside world• History of the user-computer interaction• Sophisticated understanding of language structure “Natural language-like” simulates natural language for a small domain, short history, and specialized language structures

Page 19: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 19

Potential Problems with Multimodal Applications

Voice may make an application “noisy.”

• Privacy and security concerns

• Noise pollution

Sometimes speech and handwriting recognition systems fail.

False expectations of users wanting to use natural language.

Full “natural language” processing requires:• Knowledge of outside world• History of the user-computer interaction• Sophisticated understanding of language structure “Natural language-like” simulates natural language for a small domain, short history, and specialized language structures.

Possible only on Star Trek

Incorrectly called “NLP”

Page 20: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 20

Adding a New Mode to an Application

Only if…

The new mode enables new features not previously possible.

The new modes dramatically improves the usability

Always….

Redesign the application to take advantage of the new mode.

Provide backup for the new mode.

Test, test, and test some more.

Page 21: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 21

Exercise 2

Where will multimodal applications be used?

A. At home

B. At work

C. “On the road”

D. Other?

Page 22: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 22

Developing and Deploying Multimodal Applications

What applications should be multimodal?

What is the multimodal application development process?

What standard languages can be used to develop multimodal applications?

What standard platforms are available for multimodal applications?

Page 23: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 23

The Playbill—Who’s Who on the Team 

Users—Their lives will be improved by using the multimodal application

Interaction designer—Designs the dialog—when and how the user and system interchange requests and information

Multimodal programmer—Implements VUI 

Voice talent—Records spoken prompts and messages

Grammar writer—Specifies words and phrases the user may speak in response to a prompt

TTS specialist—Specifies verbal and audio sounds and inflections

Quality assurance specialist—Performs tests to validate the application is both useful and usable

Customer—Pays the bills

Program manager—Organizes the work and makes sure it is completed according to schedule and under budget

Page 24: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 24

Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Each stage involves users

Iterative refinement

Page 25: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 25

Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Identify the Application• Conduct ethnography studies• Identify candidate applications• Conduct focus groups• Select the application

Page 26: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 26

Page 27: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 27

Exercise 3

What will be the “killer” consumer multimodal applications?

Page 28: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 28

Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Specify the Application• Construct the conceptual model• Construct scenarios• Specify performance and preference requirements

Page 29: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 29

Specify Performance and Preference Requirements

Is the application useful? Is the application enjoyable?

Performance Preference

Measure what the users actually accomplished.

Validate that the users achieved success.

Measure users’ likes and dislikes.

Validate that the users enjoyed the application and will use it

again again.

Page 30: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 30

Performance Metrics

User Task Measure Typical Criteria

Speak a command Word error rate Less than 3%

The caller supplies values into a form

Enters valid values into each field of a form

< 5 seconds per value

Navigate a list The user successfully selects the specified option.

Greater than 95%

Purchase a product The user successfully completes the purchase option.

Greater than 93%

Page 31: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 31

Exercise 4

User Task Measure Typical Criteria

Specify performance metrics for the multimodal email application

Page 32: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 32

Preference Metrics

Question Typical Criteria

On a scale from 1 to 10, rate the help facility.

The average caller score is greater than 8.

On a scale from 1 to 10, rate the ease of use of this application.

The average caller score is greater than 8.

Would you recommend using this voice portal to a friend?

Over 80% of callers respond by saying “yes.”

What would you be willing to pay to each time you use this application?

Over 80% of callers indicate that they are willing to pay $1.00 or more per use.

Page 33: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 33

Exercise 5

Question Typical Criteria

Specify preference metrics for the multimodal email application

Page 34: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 34

Preference Metrics (Open-ended Questions)

What did you like the best about this voice-enabled application? (Do not change these features.)

What did you like the least about this voice-enabled application? (Consider changing these features.)

What new features would you like to have added? (Consider adding these features in this or a later release.)

What features do you think you will never use? (Consider deleting these features.)

Do you have any other comments and suggestions? (Pay attention to these responses. Callers frequently suggest very useful ideas.)

Page 35: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 35

Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Develop the Application• Specify the persona• Specify the modes and modalities• Specify the dialog script

Page 36: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 36

UI Design Guidelines

Guidelines for Voice User Interfaces

• Bruce Balentine and David P. Morgan. How to Build a Speech Recognition Application, Second Edition. http://www.eiginc.com

Guidelines for Graphical User Interfaces

• Research-Based Web Design and Usability Guidelines. U.S. Department of Health and Human Services. http://www.usability.gov/pdfs/guidelines.html

Guidelines for Graphical User Interfaces

• Common Sense Guidelines for Developing Multimodal User Interfaces.W3C Working Group Note. 19 April 2006 http://www.w3.org/2002/mmi/Group/2006/Guidelines/

Page 37: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 37

Common-sense Suggestions1. Satisfy Real-World Constraints

Task-oriented Guidelines

1.1. Guideline: For each task, use the easiest mode available on the device.

Physical Guidelines

1.2. Guideline: If the user’s hands are busy, then use speech.

1.3. Guideline: If the user’s eyes are busy, then use speech.

1.4. Guideline: If the user may be walking, use speech for input.

Environmental Guidelines

1.5. Guideline: If the user may be in a noisy environment, then use a pen, keys or mouse.

1.6. Guideline: If the user’s manual dexterity may be impaired, then use speech.

Page 38: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 38

Exercise 6

What input mode(s) should be used for each of the following tasks?

A. Selecting objects

B. Entering text

C. Entering symbols

D. Enter sketches or illustrations

Page 39: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 39

Common-sense Suggestions2. Communicate Clearly, Concisely, and Consistently with Users

Consistency Guidelines

2.1. Phrase all prompts consistently.

2.2. Enable the user to speak keyword utterances rather than natural language sentences.

2.3. Switch presentation modes only when the information is not easily presented in the current mode.

2.4. Make commands consistent.

2.5. Make the focus consistent across modes.

Organizational Guidelines

2.6. Use audio to indicate the verbal structure.

2.7. Use pauses to divide information into natural “chunks.”

2.8. Use animation and sound to show transitions.

2.9. Use voice navigation to reduce the number of screens.

2.10. Synchronize multiple modalities appropriately.

2.11. Keep the user interface as simple as possible.

Page 40: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 40

Common-sense Suggestions3. Help Users Recover Quickly and Efficiently from Errors

Conversational Guidelines

3.1. Users tend to use the same mode that was used to prompt them.

3.2. If privacy is not a concern, use speech as output to provide commentary or help.

3.3. Use directed user interfaces, unless the user is always knowledgeable and experienced in the domain.

3.4 Always provide context-sensitive help for every field and command.

Page 41: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 41

Common-sense Suggestions3. Help Users Recover Quickly and Efficiently from Errors (Continued)

Reliability GuidelinesOperational status

3.5. The user always should be able to determine easily if the device is listening to the user.

3.6. For devices with batteries, users always should be able to determine easily how much longer the device will be operational.

3.8. Support at least two input modes so one input mode can be used when the other cannot.

Visual feedback

3.8. Present words recognized by the speech recognition system on the display, so the user can verify they are correct.

3.9. Display the n-best list to enable easy speech recognition error correction

3.10. Try to keep response times less than 5 seconds. Inform the user of longer response times.

Page 42: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 42

Common-sense Suggestions4. Make Users Comfortable

Listening mode

4.1. Speak after pressing a speak key. which automatically releases after the user finishes speaking.

System Status 4.2. Always present the current system status to the user.

Human-memory Constraints

4.3. Use the screen to ease stress on the user’s short-term memory.

Page 43: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 43

Common-sense Suggestions4. Make Users Comfortable (Continued)

Social Guidelines 4.4. If the user may need privacy, use a display rather than render speech.

4.5. If the user may need privacy, use a pen or keys.

4.6. If the device may be used during a business meeting, then use a pen or keys (with the keyboard sounds turned off).

Advertising Guidelines4.7. Use animation and sound to attract the user’s attention.

4.8. Use landmarks to help the know where he is.

Page 44: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 44

Common-sense Suggestions4. Make Users Comfortable (continued)

Ambience

4.9 Use audio and graphic design to set the mood and convey emotion in games and entertainment applications.

Accessibility

4.10 For each traditional output technique, provide an alternative output technique.

4.11. Enable users to adjust the output presentation.

Page 45: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 45

Books

Ramon Lopez-Cozar Delgado and Masahiro Araki. Spoken, Multilingual and Multimodal Dialog Systems—Development and Assessment. West Sussex, England: Wiley, 2005.

Julie A. Jacko and Andrew Sears (Editors) The Human-Computer Interaction Handbook—Fundamentals, Evolving technologies, and Emerging Applications. Mahwah, New Jersey: Lawrence Erlbaum Associates, 2003.

Page 46: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 46

Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Test The Application• Component test• Usability test• Stress test• Field test

Page 47: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 47

Testing Resources

Jeffrey Rubin. Handbook of Usability Testing. New York: Wiley Technical Communication Library, 1994.

Peter and David Leppik. Gourmet Customer Service. Eden Prairie, MN: VocalLabs, 2005. [email protected]

Page 48: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 48

Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Deploy and Monitor the Application• User Survey• Usage reports from log files• User feedback and comments

Page 49: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 49

Developing and Deploying Multimodal Applications

What applications should be multimodal?

What is the multimodal application development process?

What standard languages can be used to develop multimodal applications?

What standard platforms are available for multimodal applications?

Page 50: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 50

W3C Multimodal Interaction Framework

Recognition Grammar

Semantic Interpretation

Extended Multimodal Annotation (EMMA)

Speech Synthesis

Interaction Managers

General description of speech application

components and how they relate

Page 51: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 51

InteractionManager

ApplicationFunctions

TelephonyProperties

W3C Multimodal Interaction Framework

Input

Output

Page 52: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 52

ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation

ApplicationFunctions

User

Ink

Media Planning

AudioTelephonyFunctions

W3C Multimodal Interaction Framework

Display

Page 53: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 53

W3C Multimodal Interaction Framework

ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation

ApplicationFunctions

User

Ink

Media Planning

AudioTelephonyFunctions

Display

SRGS: Describe what the user may say at each point in the dialog

Page 54: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 54

Speech Recognition Engines

 

  Low-end High-end Other

Speaking mode Isolated (discrete) Continuous Keywords

Enrollment Speaker dependent

Speaker independent

Adaptive

Vocabulary size Small Large Switch vocabularies

Speaking style Read Spontaneous  

Number of simultaneous callers

Single-threaded Multi-threaded  

Page 55: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 55

Speech Recognition Engines 

  Low-end High-end Other

Speaking mode Isolated (discrete) Continuous Keywords

Enrollment Speaker dependent

Speaker independent

Adaptive

Vocabulary size Small Large Switch vocabularies

Speaking style Read Spontaneous  

Number of simultaneous callers

Single-threadedMulti-threaded

 

Page 56: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 56

Grammars

Describe what the user may say or handwrite at a point in the dialog

Enable the recognition engine to work faster and more accurately

Two types of grammars:– Structured Grammar– Statistical Grammar (N-grams)

Page 57: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 57

Structured Grammars

Specifies words that a user may speak or write

Two representation formats

1. Backus-Naur format (ABNF) Production Rules

Single_digit ::= zero | one | two | … | nine Zero_thru_ten ::= Single_digit | ten

2. XML format Can be processed by XML validater

Page 58: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 58

Example XML Grammar

<grammar mode = "voice" type = "application/srgs+xml" root = "zero_to_ten“>

<rule id = "zero_to_ten">       <one-of>              <ruleref uri = "#single_digit"/>              <item> ten </item>        </one-of></rule>

     <rule id = "single_digit">          <one-of>               <item> zero </item>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>          </one-of>     </rule></grammar>

Page 59: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 59

Exercise 7

Write a grammar that recognizes the digits zero through nineteen

(Hint: Modify the previous page)

Page 60: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 60

Reusing Existing Grammars

  <grammar

type = "application/srgs+xml" root = "size " src = "http://www.example.com/size.grxml"/>

Page 61: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 61

Exercise 8

Write a grammar for positive responses to a yes/no question (i.e., “yes,” “sure,” “affirmative,” and so forth)

Page 62: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 62

When Is a Grammar Too Large?

WordCoverage

Response

Page 63: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 63

W3C Multimodal Interaction Framework

ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation

ApplicationFunctions

User

Ink

Media Planning

AudioTelephonyFunctions

Display

SISR: A procedural JavaScript-like language for interpreting the text strings returned by the speech synthesis engine

Page 64: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 64

Semantic Interpretation

Semantic scripts employ ECMAScript

Advantages:– Translate aliases to vocabulary words– Perform calculations– Produces a rich structure rather than a text string

Page 65: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 65

Semantic Interpretation

Recognizer

ConversationManager

Large white t-shirt

Big white t-shirt

Grammar

Page 66: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 66

Semantic Interpretation

Recognizer

Grammar withSemantic

InterpretationScripts

SemanticInterpretation

Processor

ConversationManager

<rule id = "action"> <one-of>     <item> small <tag> out.size = "small"; </tag> </item>        <item> medium <tag> out.size = "medium"; </tag> </item>

<item> large <tag> out.size = "large"; </tag> </item> <item> big <tag> out.size = "large"; </tag> </item>    </one-of> <one-of>     <item> green <tag> out.color = "green"; </tag> </item>        <item> blue   <tag> out.color = "blue"; </tag>  </item>        <item> white <tag> out.color = "white"; </tag>  </item>    </one-of></rule>

Big white t-shirt

{ size: large color: white}

Page 67: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 67

Exercise 9 Modify this rule to return only “yes”

<grammar type = "application/srgs+xml" root = "yes" mode = "voice">

<rule id = "yes">       <one-of>              <item> yes </item>              <item> sure </item> <item> affirmative </item>

…  

</one-of> </rule>

</grammar>

Page 68: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 68

W3C Multimodal Interaction Framework

ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation

ApplicationFunctions

User

Ink

Media Planning

AudioTelephonyFunctions

Display

EMMA: A language for representing the semantic content from speech recognizers, handwriting recognizers, and other input devices

Page 69: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 69

EMMA

Extensible MultiModal Annotation markup language

Canonical structure semantic interpretations for a variety of inputs including:

• Speech

• Natural language text

• GUI

• Ink

Page 70: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 70

EMMA

Keyboard Interpretation

SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic

InterpretationInstructions

InterpretationInstructions

Applications

Page 71: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 71

EMMA

Keyboard Interpretation

SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic

InterpretationInstructions

InterpretationInstructions

Applications

<interpretation mode = "speech"> <travel> <to hook="ink"/> <from hook="ink"/> <day> Tuesday </day> </travel></interpretation>

Page 72: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 72

EMMA

Keyboard Interpretation

SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic

InterpretationInstructions

InterpretationInstructions

Applications

<interpretation mode = "speech"> <travel> <to hook="ink"/> <from hook="ink"/> <day> Tuesday </day> </travel></interpretation>

<interpretation mode = "ink"> <travel> <to>Las Vegas </to> <from>Portland </from> </travel></interpretation>

Page 73: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 73

<interpretation mode = "speech"> <travel> <to hook="ink"/> <from hook="ink"/> <day> Tuesday </day> </travel></interpretation>

<interpretation mode = "ink"> <travel> <to>Las Vegas </to> <from>Portland </from> </travel></interpretation>

EMMA

Keyboard Interpretation

SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic

InterpretationInstructions

InterpretationInstructions

Applications

<interpretation mode = "interp1"> <travel> <to> Las Vegas </to> <from> Portland </from> <day> Tuesday </day> </travel></interpretation>

Page 74: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 74

Exercise 10

<interpretation mode = "speech"> <moneyTransfer> <sourceAcct hook="ink"/> <targetAcct hook="ink"/> <amount> 300 </amount> </moneyTransfer></interpretation>

<interpretation mode = "ink"> <moneyTransfer> <sourceAcct> savings </sourceAcct> <targetAcct> checking</targetAcct> </moneyTransfer></interpretation>

Given the following two EMMA specifications, what is the unified EMMA specification?

<interpretation mode ="intp1"> <moneyTransfer> <sourceAcct> ______ </sourceAcct> <targetAcct> _______</targetAcct> <amount> ______ </amount> </moneyTransfer></interpretation>

Unified EMMA specification:

Page 75: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 75

W3C Multimodal Interaction Framework

ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation

ApplicationFunctions

User

Ink

Media Planning

AudioTelephonyFunctions

Display

SSML: A language for rendering text as synthesized speech

Page 76: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 76

Speech Synthesis Markup Language

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Markup support:emphasis, break, prosodyNon-markup behavior:automatically generate prosody through analysis of document structure andsentence syntax

Markup support:phoneme, sayasNon-markup behavior:look up in pronunciation dictionary

Markup support: sayas for dates, times, etc.Non-markup behavior: automatically identify and convert constructs

Markup support:paragraph, sentenceNon-markup behavior:infer structure byautomated text analysis

Page 77: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 77

Speech Synthesis Markup LanguageExamples

<phoneme alphabet="ipa" ph="wɪnɛfɛks"> WinFX </phoneme>is a great platform

<prosody pitch = "x-low">      Who’s been sleeping in my bed? </prosody>  said papa bear. <prosody pitch = "medium">      Who’s been sleeping in my bed? </prosody> said momma bear.  <prosody pitch = "x-high">      Who’s been sleeping in my bed? </prosody> said baby bear.

Page 78: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 78

Popular Strategy

Develop dialogs using SSML

Usability test dialogs

Extract prompts

Hire voice talent to record prompts

Replace <prompt> with <audio>

Page 79: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 79

W3C Multimodal Interaction Framework

ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation

ApplicationFunctions

User

Ink

Media Planning

AudioTelephonyFunctions

Display

VXML: A language for controlling the exchange of information and commands between the user and the system

Page 80: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 80

Developing and Deploying Multimodal Applications

What applications should be multimodal?

What is the multimodal application development process?

What standard languages can be used to develop multimodal applications?

What standard platforms are available for multimodal applications?

Page 81: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 81

Speech APIs and SDKs

• JSAPI—Java Speech Application Program Interface– http://java.sun.com/products/java-media/speech/– http://developer.mozilla.org/en/docs/JSAPI_Reference

• Nuance Mobil Speech Platform– http://www.nuance.com/speechplatform/components.asp

• VSAPI—Voice Signal API– http://www.voicesignal.com/news/articles/2006-06-21-SymbianOne.htm

• SALT– http://www.saltforum.org/

Page 82: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 82

Interaction Manager Approaches

Interaction Manager(XHTML)

VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

X+VObject-oriented

Interaction Manager(SCXML)

XHTML

VoiceXML 3.0

InkML

W3C

Page 83: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 83

Interaction Manager Approaches

X+V

Interaction Manager(SCXML)

XHTML

VoiceXML 3.0

InkML

W3C

Interaction Manager(XHTML)

VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

Object-oriented

Page 84: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 84

SAPI 5.3 & Windows Vista™Speech Synthesis

W3C Speech Synthesis Markup Language 1.0

<speak> <phoneme alphabet="ipa" ph="wɪnɛfɛks">

WinFX </phoneme>

is a great platform</speak>

Microsoft proprietary PromptBuilder

myPrompt.AppendTextWithPronunciation ("WinFX", "wɪnɛfɛks");

myPrompt.AppendText("is a great platform.");

Interaction Manager

(C#)

SAPI 5.3

Object-oriented

Page 85: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 85

SAPI 5.3 & Windows Vista™Speech Recognition

W3C Speech Recognition Grammar Specification 1.0

<grammar type="application/srgs+xml" root= "city" mode="voice"><rule id = "city">

<one-of><item> New York City </item><item> New York </item><item> Boston </item>

</one-of></rule>

</grammar>

Microsoft proprietary Grammar Builder

Choices cityChoices = new Choices();cityChoices.AddPhrase ("New York City");cityChoices.AddPhrase ("New York");cityChoices.AddPhrase ("Boston");Grammar pizzaGrammar

= new Grammar (new GrammarBuilder(pizzaChoices));

Page 86: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 86

SAPI 5.3 & Windows Vista™Semantic Interpretation

Augment SRGS grammar with Jscript® for semantic interpretation

<grammar type="application/srgs+xml" root= "city" mode="voice"><rule id = "city">

<one-of><item> New York City <tag> city="JFK" </tag></item><item> New York <tag> city = "JFK" </tag> </item><item> Portland <tag> city = "PDX" </tag></item>

</one-of></rule>

</grammar>

User-Specified “Shortcuts” recognizer replaces “shortcut word”by expanded string

User says: my address

System: 1033 Smith Street, Apt. 7C, Bloggsville 00000

Page 87: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 87

SAPI 5.3 & Windows Vista™Dialog

1. Introduce the System Speech.Recognition namespace

2. Instantiate a SpeechRecognizer object

3. Build a grammar

4. Attach an event handler

5. Load the grammar into the recognizer

6. When the recognizer hears something that fits the grammar, the SpeechRecognized event handler is invoked, which accesses the Result object and works with the recognized text

Page 88: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 88

SAPI 5.3 & Windows Vista™Dialogusing System;

using System.Windows.Forms;

using System.ComponentModel;

using System.Collections.Generic;

using System.Speech.Recognition;

namespace Reco_Sample_1

{

public partial class Form1 : Form

{

//create a recognizer

SpeechRecognizer _recognizer = new SpeechRecognizer();

public Form1() { InitializeComponent(); }

private void Form1_Load(object sender, EventArgs e)

//Create a pizza grammar

Choices pizzaChoices = new Choices();

pizzaChoices.AddPhrase("I'd like a cheese pizza");

pizzaChoices.AddPhrase("I'd like a pepperoni pizza");

{

pizzaChoices.AddPhrase("I'd like a large pepperoni pizza");

pizzaChoices.AddPhrase(

"I'd like a small thin crust vegetarian pizza");

Grammar pizzaGrammar =

new Grammar(new GrammarBuilder(pizzaChoices));

//Attach an event handler

pizzaGrammar.SpeechRecognized +=

new EventHandler<RecognitionEventArgs>(

PizzaGrammar_SpeechRecognized);

_recognizer.LoadGrammar(pizzaGrammar);

}

void PizzaGrammar_SpeechRecognized(

object sender, RecognitionEventArgs e)

{

MessageBox.Show(e.Result.Text);

}

}

}

Page 89: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 89

SAPI 5.3 & Windows Vista™References

Speech API Overview

http://msdn2.microsoft.com/en-us/library/ms720151.aspx#API_Speech_Recognition

Microsoft Speech API (SAPI) 5.3

http://msdn2.microsoft.com/en-us/library/ms723627.aspx

“Exploring New Speech Recognition And Synthesis APIs In Windows Vista” by Robert Brown

http://msdn.microsoft.com/msdnmag/issues/06/01/speechinWindowsVista/default.aspx#Resources

Page 90: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 90

Interaction Manager Approaches

X+V

Interaction Manager(SCXML)

XHTML

VoiceXML 3.0

InkML

W3C

Interaction Manager(XHTML)

VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

Object-oriented

Page 91: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 91

Step 1: Start with Standard VoiceXML and Standard XHTMLVoiceXML

<form id="topform"> <field name="city"> <prompt>Say a name</prompt> <grammar src="city.grxml"/> </field> </form>

XHTML

<form> Result: <input type="text" name="in1"/> </form>

W3C grammar language

Page 92: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 92

Step 2: Combine<html xmlns="http://www.w3.org/1999/xhtml">

<head> <form id="topform"> <field name="city"> <prompt>Say a name</vxml:prompt> <grammar src ="city.grxml"/> </field></form></head>

<body <form> Result: <input type="text" name="in1"/> </form></body>

</html>

Page 93: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 93

Step 3: Insert vxml Namespace

<html xmlns="http://www.w3.org/1999/xhtml"

xmlns:vxml="http://www.w3.org/2001/vxml">

<head> <vxml:form id="topform"> <vxml:field name="city"> <vxml:prompt>Say a name</vxml:prompt> <vxml:grammar ="city.grxml"/> </vxml:field> </vxml:form></head>

<body> <form> Result: <input type="text" name="in1"/ </form></body>

</html>

Page 94: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 94

Step 4: Insert event

<html xmlns=http://www.w3.org/1999/xhtml xmlns:vxml=http://www.w3.org/2001/vxml xmlns:ev="http://www.w3.org/2001/xml-events">

<head> <vxml:form id="topform"> <vxml:field name="city"> <vxml:prompt>Say a name</vxml:prompt> <vxml:grammar src ="city.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#topform"> Result: <input type="text" name="in1"/> </form></body>

</html>

Page 95: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 95

Step 5: Insert <sync><html xmlns=http://www.w3.org/1999/xhtml xmlns:vxml=http://www.w3.org/2001/vxml xmlns:ev=http://www.w3.org/2001/xml-events xmlns:xv="http://www.w3.org/2002/xhtml+voice">

<head> <xv:sync xv:input="in1" xv:field="#result"/> <vxml:form id="topform"> <vxml:field name="city" xv:id="result"> <vxml:prompt>Say a name</vxml:prompt> <vxml:grammar src ="city.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#topform"> Result: <input type="text" name="in1"/> </form></body>

</html>

Page 96: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 96

XHTML plus Voice (X+V) References

• Available on– ACCESS Systems’ NetFront Multimodal Browser for PocketPC 2003

http://www-306.ibm.com/software/pervasive/ multimodal/?Open&ca=daw-prod-mmb

– Opera Software Multimodal Browser for Sharp Zaurushttp://www-306.ibm.com/software/pervasive/ multimodal/?

Open&ca=daw-prod-mmb– Opera 9 for Windows

http://www.opera.com/

• Programmers Guide– ftp://ftp.software.ibm.com/software/pervasive/info/multimodal /

XHTML_voice_programmers_guide.pdf

• For a variety of small illustrative applications– http://www.larson-tech.com/MM-Projects/Demos.htm

Page 97: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 97

Exercise 11

Specify the X+V notation for integrating the following VoiceXML and XHTML code by completing the code on the next page

VoiceXML

<form id="stateForm"> <field name="state"> <prompt>Say a state name</prompt> <grammar src="city.grxml"/> </field> </form>

XHTML

<form> Result: <input type="text" name="in1"/> </form>

Page 98: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 98

Exercise 11 (continued)

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:vxml="http://www.w3.org/2001/vxml" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:xv="http://www.w3.org/2002/xhtml+voice">

<head> <xv:sync xv:input="_______" xv:field="________"/> <vxml:form id="________"> <vxml:field name="state" xv:id="________“> <vxml:prompt>Say a state name</vxml:prompt> <vxml:grammar src ="state.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#________"> Result: <input type="text" name="_______"/> </form></body>

</html>

Page 99: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 99

Interaction Manager Approaches

X+V

Interaction Manager(SCXML)

XHTML

VoiceXML 3.0

InkML

W3C

Interaction Manager(XHTML)

VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

Object-oriented

Page 100: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 100

MMI Architecture—4 Basic Components

• Runtime Framework or Browser— initializes application and interprets the markup

• Interaction Manager—coordinates modality components and provides application flow

• Modality Components—provide modality capabilities such as speech, pen, keyboard, mouse

• Data Model—handles shared data

Interaction Manager (SCXML)

XHTML

VoiceXML 3.0

InkML

DataModel

Page 101: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 101

Multimodal Architecture and Interfaces

• A loosely-coupled, event-based architecture for integrating multiple modalities into applications

• All communication is event-based

• Based on a set of standard life-cycle events

• Components can also expose other events as required

• Encapsulation protects component data

• Encapsulation enhances extensibility to new modalities

• Can be used outside a Web environment

XHTML

VoiceXML 3.0

InkML

Interaction Manager (SCXML)

DataModel

Page 102: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 102

Specify Interaction Manager Using Harel State Charts

Extension of state transition systems

• States

• Transitions

• Nested state-transition systems

• Parallel state-transition systems

• History

PrepareState

StartState

WaitState

EndState

FailState

PrepareResponse(success)

StartResponse

DoneSuccess

StartFail

DoneFail

PrepareResponse(fail)

Page 103: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 103

Example State Transition System

State Chart XML (SCXML)

<state id="PrepareState">

<send event="prepare" contentURL="hello.vxml"/>

<transition event="prepareResponse" cond="status='success'" target="StartState"/>

<transition event="prepareResponse" cond="status='failure'" target="FailState"/>

</state>

PrepareState

StartState

WaitState

EndState

FailState

PrepareResponse(success)

StartResponse

DoneSuccess

StartFail

DoneFail

PrepareResponse(fail)

Page 104: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 104

Example State Chart with Parallel States

PrepareVoice

StartVoice

WaitVoice

EndVoice

Fail Voice

PrepareResponse

Success

StartResponse

DoneSuccess

Start Fail

Done Fail

PrepareGUI

StartGUI

WaitGUI

EndGUI

Fail GUI

PrepareResponse

Success

StartResponse

DoneSuccess

Start Fail

Done Fail

PrepareResponseFail

PrepareResponseFail

Page 105: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 105

The Life Cycle EventsInteractionManager

GUI VUI

prepareprepare

prepareResponse prepareResponse

InteractionManager

GUI VUI

startstart

startResponse startResponse

InteractionManager

GUI VUI

cancelcancel

cancelResponse cancelResponse

InteractionManager

GUI VUI

pausepause

pauseResponse pauseResponse

InteractionManager

GUI VUI

resumeresume

resumeResponse resumeResponse

Page 106: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 106

More Life Cycle Events

InteractionManager

GUI VUI

newContextRequestnewContextRequest

newContextResponse newContextResponse

InteractionManager

GUIVUI

data data

InteractionManager

GUIdone

InteractionManager

GUI VUI

clearContextclearContext

Page 107: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 107

Synchronization Using the Lifecycle Data Event

• Intent-based events– Capture the underlying intent

rather than the physical manifestation of user-interaction events

– Independent of the physical characteristics of particular devices

• Data/reset– Reset one or more field values to

null

• Data/focus– Focus on another field

• Data/change– Field value has changed

InteractionManager

GUI VUIdata data

Page 108: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 108

Interaction Manager

Lifecycle Events between Interaction Manager and Modality

Modality

PrepareState

StartState

WaitState

EndState

FailState

PrepareResponseSuccess)

StartResponse

DoneSuccess

Start Fail

DoneFail

PrepareResponseFail

prepare

prepare response (success)

start

start response (success)

data

done

prepare response (failure)

start response (failure)

Page 109: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 109

MMI Architecture Principles

• Runtime Framework communicates with Modality Components through asynchronous events

• Modality Components don’t communicate directly with each other, but indirectly through the Runtime Framework

• Components must implement basic life cycle events, may expose other events

• Modality components can be nested (e.g. a Voice Dialog component like a VoiceXML <form>)

• Components need not be markup-based

• EMMA communicates users’ inputs to the Interaction Manager

Page 110: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 110

Modalities

• GUI Modality (XHTML)– Adapter converts Lifecycle

events to XHTML events– XHTML events converted to

lifecycle events

XHTML

VoiceXML 3.0

Interaction Manager (SCXML)

DataModel

• Voice Modality (VoiceXML 3.0)– Lifecyle events are embedded

into VoiceXML 3.0

Page 111: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 111

Exercise 12

What should VoiceXML do when it receives each of the following events?

A. Reset

B. Change

C. Focus

Page 112: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 112

ModalitiesVoiceXML 3.0 will support lifecycle events.

<form> <catch name="change"> <assign name="city" value="data"/> </catch>

<field name = "city"> <prompt> Blah </prompt> <grammar src="city.grxml"/> <filled> <send event="data.change" data="city"/> </filled> </field>

</form>

XHTML

VoiceXML 3.0

Interaction Manager (SCXML)

DataModel

Page 113: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 113

Exercise 13

What should HTML do when it receives each of the following events?

A. Reset

B. Change

C. Focus

Page 114: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 114

ModalitiesXHTML is extended to support lifecycle eventssent to a modality.

<head>…<ev:Listener ev:event="onChange" ev:observer="app1" ev:handler="onChangeHandler()";>…<script>{function onChangeHandler() post ("data", data="city")}</script></head>

<body id="app1"? <input type="text" id=city "value= " "/></body>

XHTML

VoiceXML 3.0

Interaction Manager (SCXML)

DataModel

Page 115: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 115

ModalitiesXHTML is extended to support lifecycle eventssent to the interaction manager

<head>…<handler type="text/javascript“ ev:event="data" if (event="change" {document.app1.city.value="data.city"}</handler>…</head>

<body id="app1"? <input type="text" id="city" value=" "/>

</body>…

XHTML

VoiceXML 3.0

Interaction Manager (SCXML)

DataModel

Page 116: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 116

References

• SCXML– Second working draft available at

http://www.w3.org/TR/2006/WD-scxml-20060124/– Open Source available from

http://jakarta.apache.org/commons/sandbox/scxml/

• Multimodal Architecture and Interfaces – Working draft available at http://www.w3.org/TR/2006/WD-mmi-arch-

20060414/

• Voice Modality– First working draft VoiceXML 3.0 scheduled for November 2007

•XHTML– Full recommendation– Adapters must be hand-coded

• Other modalities– TBD

Page 117: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 117

Comparison

Object-oriented X+V W3C

Standard Languages SRGS VoiceXML SCXMLSISR SRGS SRGSSSML SSML VoiceXML

SISR SSMLXHTML SISR

XHTMLEMMA

CCXML

Interaction Manager C# XHTML SCXML

Modes GUI GUI GUISpeech Speech Speech

Ink …

Page 118: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 118

Availability

SAPI 5.3– Microsoft Windows Vista®

X+V – ACCESS Systems’ NetFront Multimodal Browser for PocketPC 2003

http://www-306.ibm.com/software/pervasive/multimodal/?Open&ca=daw-prod-mmb

– Opera Software Multimodal Browser for Sharp Zaurushttp://www-306.ibm.com/software/pervasive/

multimodal/?Open&ca=daw-prod-mmb– Opera 9 for Windows

http://www.opera.com/

W3C– First working draft of VoiceXML 3.0 not yet available– Working drafts of SCXML are available; some open-source implementations are

available

Proprietary APIs– Available from vendor

Page 119: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 119

Discussion Question

Should a developer insert SALT tags or X+V modules into an existing Web page without redesigning the Web page?

Page 120: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 120

Conclusion

•Multimodal applications offer benefits over today’s traditional GUIs.

•Only use multimodal if there is a clear benefit.

•Standard languages are available today to develop multimodal applications.

•Don’t reinvent the wheel.

•Creativity and lots of usability testing are necessary to create world-class multimodal applications.

Page 121: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 121

Web Resources

http://www.w3.org/voice

– Specification of grammar, semantic interpretation, and speech synthesis languages

http://www.w3.org/2002/mmi

– Specification of EMMA and InkML languages

http:/www.microsoft.com (and query SALT)

– SALT specification and download instructions for adding SALT to Internet Explorer

http://www-306.ibm.com/software/pervasive/multimodal/

– X+V specification; download Opera and ACCESS browsers

http://www.larson-tech.com/SALT/ReadMeFirst.html

– Student projects using SALT to develop multimodal applications

http://www.larson-tech.com/MMGuide.html or http://www.w3.org/2002/mmi/Group/2006/Guidelines/

– User interface guidelines for multimodal applications

Page 122: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 122

Working Draft

Recommendation

Status of W3C Multimodal Interface Languages

Proposed Recommendation

CandidateRecommendation

Last CallWorking Draft

Requirements

VoiceXML 2.0

SpeechRecog-nition

GrammarFormat(SRGS)

1.0

SpeechSynthesisMarkup

Language(SSML)

1.0 ExtendedMulti-modal

Interaction(EMMA)

1.0

SemanticInterpret-

ationof

SpeechRecog-nition(SISR)

1.0

StateChartXML

(SCXML)1.0

InkXL1.0

VoiceXML 2.1

Page 123: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 123

Questions

?

Page 124: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 124

Answer to Exercise 5

Content- manipulation task

Voice Pen Keyboard/keypad

Mouse/joystick

Select objects

(3) Speak the name of the object

(1) Point to or circle the object

(4) Press keys to position the cursor on the object and press the select key

(2) Point to and click on the object or drag to select text

Enter text (2) Speak the words in the text

(3) Write the text (1) Press keys to spell the words in the text

(4) Spell the text by selecting letters from a soft keyboard

Enter symbols (3) Say the name of the symbol and where it should be placed.

(1) Draw the symbol where it should be placed

(4) Enter one or more characters that together represent the symbol

(2) Select the symbol from a menu and indicate where it should be placed

Enter sketches or illustrations

(2) Verbally describe the sketch or illustration

(1) Draw the sketch or illustration

(4) Impossible (3) Create the sketch by moving the mouse so it leaves a trail (similar to an Etch-a-Sketch™)

Page 125: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 125

Answer to Exercise 7Write a grammar for zero to nineteen

<grammar type = "application/srgs+xml" root = "zero_to_19" mode = "voice">

<rule id = "zero_to_19">       <one-of>              <ruleref uri = "#single_digit"/>

        <ruleref uri ="#teens">

</one-of></rule>

 <rule id = "single_digit">        <one-of>               <item> zero </item>               <item> one </item>               <item> two </item>               <item> three </item>               <item> four </item>               <item> five </item>               <item> six </item>               <item> seven </item>               <item> eight </item>              <item> nine </item>         </one-of></rule>

<rule id = "#teens">   <one-of>             <item> ten</item>  <item> eleven </item>         <item> twelve </item>         <item> thirteen </item>         <item> fourteen </item>             <item> fifteen </item>             <item> sixteen </item>             <item> seventeen </item>             <item> eighteen </item>             <item> nineteen </item>     </one-of> </rule></grammar>

Page 126: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 126

Answer to Exercise 8

<grammar type = "application/srgs+xml" root = "yes" mode = "voice">

<rule id = "yes">        <one-of>              <item> yes </item>              <item> sure </item> <item> affirmative </item>

…  

</one-of> </rule>

</grammar>

Page 127: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 127

Answer to Exercise 9

<grammar type = "application/srgs+xml" root = "yes" mode = "voice">

<rule id = "yes">       <one-of>              <item> yes </item>              <item> sure <tag> out = "yes" </tag> </item> <item> affirmative <tag> out = "yes" </tag> </item>

</one-of> </rule>

</grammar>

Page 128: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 128

Answer to Exercise 10

<interpretation mode = "speech"> <moneyTransfer> <sourceAcct hook="ink"/> <targetAcct hook="ink"/> <amount> 300 </amount> </moneyTransfer></interpretation>

<interpretation mode = "ink"> <moneyTransfer> <sourceAcct> savings </sourceAcct> <targetAcct> checking</targetAcct> </moneyTransfer></interpretation>

Given the following two EMMA specifications, what is the unified EMMA specification?

<interpretation mode = "intp1"> <moneyTransfer> <sourceAcct> savings </sourceAcct> <targetAcct> checking</targetAcct> <amount> 300 </amount> </moneyTransfer></interpretation>

Page 129: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 129

Answer to Exercise 11

<html xmlns= "http://www.w3.org/1999/xhtml" xmlns:vxml= "http://www.w3.org/2001/vxml" xmlns:ev= "http://www.w3.org/2001/xml-events" xmlns:xv="http://www.w3.org/2002/xhtml+voice">

<head> <xv:sync xv:input="in4" xv:field="#answer"/> <vxml:form id= "stateForm"> <vxml:field name= "state" xv:id= "answer"> <vxml:prompt>Say a state name</vxml:prompt> <vxml:grammar src = "state.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#stateForm"> Result: <input type="text" name="in4"/> </form></body>

</html>

Page 130: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 130

Exercise 12

What should HTML do when it receives each of the following events?

• Reset– Reset the value

• Change– Change the value

• Focus– Prompt for the value now in focus

Page 131: Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007.

James A. Larson Developing & Delivering Multimodal Applications 131

Exercise 13

What should HTML do when it receives each of the following events?

• Reset– Reset the value– Author decides if cursor should be moved to the reset value

• Change– Change the value– Author decides if cursor should be moved to the reset value

• Focus– Move the cursor to the item in focus