CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1...
Transcript of CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1...
![Page 1: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/1.jpg)
11/24/2004 1
CS 160: Lecture 24
Professor John CannyFall 2004
![Page 2: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/2.jpg)
11/24/2004 2
Speech: the Ultimate Interface?In the early days of HCI, people assumed that speech/natural language would be the ultimate UI (Licklider’s OLIVER).
There have been sophisticated attempts to duplicate such behavior (e.g. Extemposystems, Verbot) – But text seems to be the preferred communication medium.
MS Agents are an open architecture (you can write new ones). They can do speech I/O.
![Page 3: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/3.jpg)
11/24/2004 3
Speech: the Ultimate Interface?In the early days of HCI, people assumed that speech/natural language would be the ultimate UI (Licklider’s OLIVER). Critique that assertion…
![Page 4: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/4.jpg)
11/24/2004 4
Advantages of GUIsSupport menus (recognition over recall).
Support scanning for keyword/icon.
Faster information acquisition (cursory readings).
Fewer affective cues.
Quiet!
![Page 5: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/5.jpg)
11/24/2004 5
Advantages of speech?
![Page 6: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/6.jpg)
11/24/2004 6
Advantages of speech?Less effort and faster for output (vs. writing).
Allows a natural repair process for error recovery (if computer’s knew how to deal with that..)
Richer channel - speaker’s disposition and emotional state (if computer’s knew how to deal with that..)
![Page 7: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/7.jpg)
11/24/2004 7
Multimodal InterfacesMulti-modal refers to interfaces that support non-GUI interaction.
Speech and pen input are two common examples - and are complementary.
![Page 8: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/8.jpg)
11/24/2004 8
Speech+pen InterfacesSpeech is the preferred medium for subject, verb, object expression.
Writing or gesture provide locative information (pointing etc).
![Page 9: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/9.jpg)
11/24/2004 9
Speech+pen InterfacesSpeech+pen for visual-spatial tasks (compared to speech only)* 10% faster.* 36% fewer task-critical errors.* Shorter and simpler linguistic constructions.* 90-100% user preference to interact this way.
![Page 10: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/10.jpg)
11/24/2004 10
Put-That-There
User points at object, and says “put that” (grab), then points to destination and says “there” (drop).* Very good for deictic actions, (speak and
point), but these are only 20% of actions. For the rest, need complex gestures.
![Page 11: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/11.jpg)
11/24/2004 11
Multimodal advantages
Advantages for error recovery:* Users intuitively pick the mode that is less
error-prone.* Language is often simplified.* Users intuitively switch modes after an error,
so the same problem is not repeated.
![Page 12: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/12.jpg)
11/24/2004 12
Multimodal advantages
Other situations where mode choice helps:* Users with disability.* People with a strong accent or a cold.* People with RSI. * Young children or non-literate users.
![Page 13: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/13.jpg)
11/24/2004 13
Multimodal advantages
For collaborative work, multimodal interfaces can communicate a lot more than text:* Speech contains prosodic information.* Gesture communicates emotion.* Writing has several expressive dimensions.
![Page 14: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/14.jpg)
11/24/2004 14
Multimodal challengesUsing multimodal input generally requires advanced recognition methods:* For each mode.* For combining redundant information. * For combining non-redundant information: “open
this file (pointing)”
Information is combined at two levels:* Feature level (early fusion).* Semantic level (late fusion).
![Page 15: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/15.jpg)
11/24/2004 15
Break
![Page 16: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/16.jpg)
11/24/2004 16
AdminstrativeFinal project presentations on Dec 6 and 8.
Presentations go by group number. Groups 6-10 on Monday 6, groups 1-5 on Friday 8.
Presentations are due on the Swiki on Weds Dec 8. Final reports due Friday Dec 3rd. Posters are due Mon Dec 13.
![Page 17: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/17.jpg)
11/24/2004 17
Early fusion
Feature recognizer
Vision data Speech data Other sensor data
Feature recognizer
Feature recognizer
Action recognizer
Fusion data
![Page 18: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/18.jpg)
11/24/2004 18
Early fusionEarly fusion applies to combinations like speech+lip movement. It is difficult because:* Of the need for MM training data.* Because data need to be closely synchronized.* Computational and training costs.
![Page 19: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/19.jpg)
11/24/2004 19
Late fusion
Feature recognizer
Vision data Speech data Other sensor data
Feature recognizer
Feature recognizer
Action recognizer
Fusion data
Action recognizer
Action recognizer
Recognized Actions
![Page 20: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/20.jpg)
11/24/2004 20
Late fusionLate fusion is appropriate for combinations of complementary information, like pen+speech.* Recognizers are trained and used separately. * Unimodal recognizers are available off-the-shelf.* Its still important to accurately time-stamp all
inputs: typical delays are known between e.g. gesture and speech.
![Page 21: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/21.jpg)
11/24/2004 21
ExamplesSpeech understanding:* Feature recognizers = Phoneme, “Moveme,” * Action recognizer = word recognizer
Gesture recognition:* Feature recognizers = Movemes (from different
cameras)* Action recognizers = gesture (like stop, start,
raise, lower)
![Page 22: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/22.jpg)
11/24/2004 22
ExerciseWhat method would be more appropriate for:* Pen gesture recognition using a combination of
pen motion and pen tip pressure?
* Destination selection from a map, where the user points at the map and says to the name of the destination?
![Page 23: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/23.jpg)
11/24/2004 23
Contrast between MM and GUIsGUI interfaces often restrict input to single non-overlapping events, while MM interfaces handle all inputs at once.
GUI events are unambiguous, MM inputs are (usually) based on recognition and require a probabilistic approach
MM interfaces are often distributed on a network.
![Page 24: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/24.jpg)
11/24/2004 24
Agent architecturesAllow parts of an MM system to be written separately, in the most appropriate language, and integrated easily.
OAA: Open-Agent Architecture (Cohen et al) supports MM interfaces.
Blackboards and message queues are often used to simplify inter-agent communication. * Jini, Javaspaces, Tspaces, JXTA, JMS, MSMQ...
![Page 25: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/25.jpg)
11/24/2004 25
Symbolic/statistical approachesAllow symbolic operations like unification (binding of terms like “this”) + probabilistic reasoning (possible interpretations of “this”).
The MTC system is an example* Members are recognizers.* Teams cluster data from recognizers.* The committee weights results from various
teams.
![Page 26: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/26.jpg)
11/24/2004 26
MTC architecture
![Page 27: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/27.jpg)
11/24/2004 27
Probabilistic Toolkits
The “graphical models toolkit” U. Washington (Bilmes and Zweig). * Good for speech and time-series data.
MSBNx Bayes Net toolkit from Microsoft (Kadie et al.)
UCLA MUSE: middleware for sensor fusion (also using Bayes nets).
![Page 28: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/28.jpg)
11/24/2004 28
MM systems
Designers Outpost (Berkeley)
![Page 29: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/29.jpg)
11/24/2004 29
MM systems: Quickset (OGI)
![Page 30: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/30.jpg)
11/24/2004 30
Crossweaver (Berkeley)
![Page 31: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/31.jpg)
11/24/2004 31
Crossweaver (Berkeley)Crossweaver is a prototyping system for multi-modal (primarily pen and speech) UIs.
Also allows cross-platform development (for PDAs, Tablet-PCs, desktops.
![Page 32: CS 160: Lecture 24 - people.eecs.berkeley.edujfc/cs160/F04/lectures/lec24/lec24… · 11/24/2004 1 CS 160: Lecture 24 Professor John Canny Fall 2004. 11/24/2004 2 Speech: the Ultimate](https://reader036.fdocuments.net/reader036/viewer/2022062507/5fc91d127ddca2418a6eb208/html5/thumbnails/32.jpg)
11/24/2004 32
Summary
Multi-modal systems provide several advantages.Speech and pointing are complementary.Challenges for multi-modal.Early vs. late fusion.MM architectures, fusion approaches.Examples of MM systems.