The Universal Speech Interface (USI) PDG Progress Report Thomas Harris, Stefanie Tomko, Arthur Toth,...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of The Universal Speech Interface (USI) PDG Progress Report Thomas Harris, Stefanie Tomko, Arthur Toth,...
The Universal Speech Interface (USI) PDG Progress Report
Thomas Harris, Stefanie Tomko, Arthur Toth, James Sanders,
Alex Rudnicky, Roni Rosenfeld
School of Computer Science
Carnegie Mellon University
4 June 2003
Outline
• USI Project Summary• USI Device Control• USI User Studies• Tech Transfer Initiative
– USI Application Generator
Program Goals and Plan
• Overall program goal: – Design a universal (i.e. device-independent)
interface for speech-based interaction with wearable and home devices
• Program plan & milestones:– Q1: analysis, interaction principles– Q2: build device-simulation environment– Q3: build first device prototype– Q4: initial user studies; development tools
Program Deliverables
• A novel universal design for speech-based interaction with wearable- and home-devices
• At least one demonstration system exemplifying the new interface
• A set of tools for rapid prototyping of compliant applications
The Universal Speech Interface (USI)In a Nutshell
• Unifying approach to human-machine speech communication
• Unified “look and feel” across all applications– analogous to the Xerox/Macintosh/Windows GUI
look-and-feel
• Stylized, semi-natural interaction– analogous to the “Graffiti” alphabet for the Palm
PDA
Existing Speech Paradigm 1:Command-and-control Systems
• Specialized language, optimized for a given application– each application has its own interface
• Intensive training of each user• Daily use helps retain knowledge
Existing Speech Paradigm 2:Unconstrained Dialog Systems
• “Off-the-street” users, no training required• System models existing human behavior• But this comes at a cost:
– each application requires a great deal of data, labor, human expertise
– Speech Recognition technology is pushed to the limit– user does not easily grasp the application’s
functional limits• Out-Of-Vocabulary words (OOV)• Out-Of-Domain concepts, requests
Is a Third Paradigm Needed?
• In practice, people are likely to use:– a handful of apps daily:
• scheduler, contact manager, email,...
– many apps occasionally:• weather, restaurants, ...
• To exploit this, we need:– flexible, powerful interface for familiar applications.– immediate engagement with occasional or new
applications.
Our Approach
• Identify application-independent universals:– user-side– machine-side
• Find suitable, general solutions– Human and machine meeting halfway
• Design a stylized, universal “look and feel”• Teach it in 5 minutes
Universal Semantic primitives
• Help primitives– what can the machine do? how do I do X? what can I say?
• Speech channel primitives– detect & correct ASR errors; finished talking?
• Interaction primitives– turn taking; question answering; session management; undo
• Application primitives– environment variables: query, set– objects (e.g. lists): describe, navigate, create, modify, delete
USI Systems Developed
• Information Access– MovieLine– FlightLine– ApartmentLine
• Device Control– Stereo system– X-10 control (e.g., lights)– Alarm Clock applet– Digital Video Camera– Windows Media Player
Device Interaction Analysis
• Analysis was done on multiple devices– alarm clock / radio– VCR– cell phone– MP3 player– memo pad / email / vmail– copier/fax
USI/Device Design Issues
• Confirmation strategy• Error handling strategy• Exploration• Navigation• Disambiguation / context mgmt• Orientation• Querying state variables
USI/Device Design Issues
• Confirmation strategy: restate-&-execute
• Error handling strategy: ignore
• Exploration: “OPTIONS”
• Navigation: use concept of ‘focus’
• Disambiguation / context mgmt: implicit
• Orientation: “STATUS”
• Querying state variables: “WHAT IS THE...?”
Hooking up with the PUC project
• Fits within the PUC project’s vision of automatically generated interfaces with different modalities and form factors
• But, can also be used as a standalone speech interface
• Compatibility with visual design is desirable, but not always natural:– nameless states (speech interface must have
name for everything!)– speech interface can have shortcuts (“MODE: CD”
vs. “CD”)
Meshing with the PUC project
• Device capabilities specified by XML doc• States vs. Action dichotomy of the visual
interface does not always conform to speech interface intuition.
• For now, creating our own interface specification document
• Ultimately, will augment XML DTD, so both interfaces can co-exist
USI Device control(a.k.a. James the Butler)
frequency...
station...
am
frequency...
station...
fm
(radioband)
forw ard
backw ard
seek
tuner auxiliary
play
pause
stop
(status)
#
disc
next track last track
random ... repeat...
cd
(m ode)<turns stereo on>
on
off
x-bass
volum e up
volum e dow n
volum e off
Stereo
digital camera...
James
Hardware hacking courtesy of the PUC project
User study
• Compared Speech Graffiti (SG) & natural language MovieLines
• How does Speech Graffiti compare to a natural language interface?– Subjective user satisfaction– Task completion rates– Word error rates
• How do well do users "get" Speech Graffiti?– How often do they speak within the grammar?– In what ways do they deviate from the grammar?
Subjective user satisfaction
• 17 of 23 preferred Speech Graffiti (SG)
1 2 3 4 5 6 7
system resp. acc.
likeability
cog. demand
annoyance
habitability
speed
OVERALL
mean user satisfaction rating
NL-ML
SG-ML
• SG user satisfaction ratings higher than NL in all categories
• SG ratings positive except in annoyance & habitability
Computer experience & training
• Computer Science / Engineering backgrounds and / or programming experience – Higher user satisfaction ratings– Better task completion rates
• Training in-domain vs. out-of-domain– No differences in user satisfaction or task
completion rates
Task completion
• Overall– 67.9% SG tasks– 67.4% NL tasks
• Individual means– 5.43 of 8 SG tasks– 5.30 of 8 NL tasks
0
1
2
3
4
5
6
7
8
mean t
ask
com
ple
tion r
ate
SG-ML NL-ML
Time-to-completion
• Completed tasks– 67.9 seconds SG – 73.4 seconds NL
• Incomplete tasks:
1 2 3 4
0
200
400
600
time, in seconds
“best case” “real world”
27.3
43.5
76.0
23.0
38.0
103.8
(inc)
81.5
34.0
(inc)
103.0
28.0
59 incompletes 59 incompletes
SGML SGMLNLML NLML
Turns-to-completion
• Completed tasks– 8.2 turns SG – 3.9 turns NL
• Incomplete tasks:
1 2 3 4
5
20
3535
5
20
(inc) (inc)
4
5
9.75
1
2
510
4
5
“best case” “real world”
# of turns
SG-ML SG-MLNL-ML NL-ML
59 incompletes 59 incompletes
2
Word error rates
• Very high for both systems– On "cleaned" set (on-task, non-noisy utts)
• Concept error is lower for USI – SG: –29.2% from WER– NL: +0.8% from WER
• Low error rate is key to acceptance– 6 who preferred NL-ML had highest SG WER
WER# of utts
subj mean
subj median
SG Movie 35.1% 3626 35.0% 30.0%NL Movie 51.2% 1854 50.3% 48.9%
WER & user satisfaction
• Good correlation for SG
SG-ML
% word-error rate0 20 40 60 80
1
2
3
4
5
6
0 20 40 60 801
2
3
4
5
6
user
sati
sfa
cti
on
rati
ng
NL-ML
How often do users speak within the Speech Graffiti grammar?
• Actually, pretty often!
… and
• grammaticality leads to user satisfaction
mean 80.5%median 87.4%
1
2
3
4
5
6
7
0% 20% 40% 60% 80% 100%
% grammatical
use
r sa
tisf
act
ion r
ati
ng
How do users deviate from the grammar?
slot only14.6%
time syntax1.3%
subject-verb agreement
5.7%
more syntax4%
plural+options
2%
disfluency4.3%
keyword problem8.1%
value+options
1%
missing is/are
11%
endpoint1.6%
value only6.7%
out-of-vocabulary
concept5.1%
out-of-vocabulary word
14.0%
general syntax20.6%
Future Interface Design Work
• Redesign Help facility– SG works best for those who "get it"– Current system provides no assistance to "clueless user"
• Error analysis– Compare failure cases in SG and NL interfaces– Compare user recovery attempts in SG and NL
• Address issues of generalizability– Promoting transparency of slot set and response sets– Accessing information sets rather than single items
• Adjust grammar components
Future Architecture Work
• Integrate current USI environments– Information Access– Device Control
• Improve interface between PUC and USI components
• Identify USI-specific techniques to achieve lower WER
• Improved documentation and distribution packaging
Tech Transfer Initiative
• Tools for creating new USI apps– 3 days to create a new application– prior exposure to speech technology highly
beneficial– decided to further reduce the barrier create an application generator
From 3 Days to a Few Hours
• A USI Application Generator• New USI applications w/out programming!• XML document fully specifies the
application– slot names– accepted inputs– data types– slot properties– ...
From a Few Hours to 15 minutes?
• Created a Web interface to generating the XML document
• Form filling, pulldown menus• Strong effort to further simplify the process,
minimize complexity of form– many defaults– for less common choices, edit the XML doc.
• More importantly, no computer savvy needed
Web Application Generator
• Repository and tool for creating USI database applications
• Abundant online help to guide users through process
• Accessible to anyone with an Internet connection
Web Application Generator
• Two step process:– General specification – Slot-by-slot specification
• choose datatype from built-in list, or create own
• Fully featured system with save, copy, delete functionality
• Hides intricacies of XML document writing• Advanced users have ability to further
alter the final XML document
Web Application Generator
• Built-in generic voice; can record own voice• DB backend
– Postgres– Oracle– ODBC (including ASCII files)– Ultimately: web tables
• Platform:– originally: mixed Unix/Windows, telephone based– converted to: pure Windows, telephone or laptop
Transferring USI to PDG members
• We do house calls!– Carnegie Mellon will install USI developer
environment for each interested member and will train member staff in the use of the developer environment
– Provide a short tutorial on USI principles and interface design