Multimodal Interaction: an integrated speech and gaze approach

Politecnico di Torino

Multimodal Interaction:

an integrated speech and gaze

approach

Relatore: Candidata:

Prof. Fulvio Corno Alessandra Pireddu

Corelatore:

Ing. Laura Farinetti

Aprile 2007

Index

Introduction.......................................................................................................................................... 1

Context and Motivations..................................................................................................................4

Related Works..................................................................................................................................6

PART I................................................................................................................................................. 8

Multimodal Interactions....................................................................................................................... 9

Evaluation of Multimodal Systems................................................................................................11

Eye-Gaze Tracking ............................................................................................................................ 12

History of Eye Tracking in HCI.....................................................................................................13

Eye – gaze-tracking methods .........................................................................................................14

Remote Eye-Gaze Tracker .............................................................................................................15

An example: Infrared eye-gaze tracking....................................................................................15

Commercial Solutions....................................................................................................................17

Future Developments .....................................................................................................................18

Voice .................................................................................................................................................. 20

Automatic Speech Recognition......................................................................................................21

History............................................................................................................................................22

ASR Structure ................................................................................................................................23

Pronunciation Model..................................................................................................................23

Continuity of Speech..................................................................................................................24

Applications of Speech Recognition..........................................................................................24

Command recognition................................................................................................................24

Dictation.....................................................................................................................................24

Speech Synthesis............................................................................................................................26

Vocal Platforms .............................................................................................................................27

Loquendo VoxNauta......................................................................................................................28

VoiceXML .....................................................................................................................................32

Commercial Solutions....................................................................................................................37

Screen Readers...............................................................................................................................38

Assistive Technology.................................................................................................................39

Accessibility...............................................................................................................................39

Use of DOM and MSAA ...........................................................................................................40

MSAA interfaces and methods ..................................................................................................42

Commercial Solutions................................................................................................................46

PART II.............................................................................................................................................. 47

Integrating eye-gaze tracking and voice recognition ......................................................................... 48

Requirement analysis ......................................................................................................................... 49

Proposed Solution ..........................................................................................................................50

A. Eye tracker simulator ........................................................................................................52

B. Screen Reader....................................................................................................................52

C. Vocal Platform ..................................................................................................................53

D. Link eye tracking - voice ..................................................................................................54

E. Command Execution .........................................................................................................55

Unit Development.............................................................................................................................. 56

Eye-Gaze Tracker Unit ..................................................................................................................56

Objects Position .........................................................................................................................58

Screen Reader Unit ........................................................................................................................58

Link eye tracking – voice...............................................................................................................62

HTTP Message Interpreter.........................................................................................................63

Receiver .....................................................................................................................................63

VoiceXML Unit .............................................................................................................................64

The voicexml application and grammar ....................................................................................64

Testing................................................................................................................................................ 66

Test Description .............................................................................................................................67

Test Goals ......................................................................................................................................68

Test 1: Context...............................................................................................................................69

Test 1: Result .................................................................................................................................72

Test 2..............................................................................................................................................79

Test Results ....................................................................................................................................81

1. Quality and usability requirements fulfillment ......................................................................81

2. Performances requirements fulfillment..................................................................................81

Conclusions........................................................................................................................................ 82

References.......................................................................................................................................... 84

Index of Figures

Figure 1:Human-computer interactions _______________________________________________ 9

Figure 2: infrared tracking system mounted on a monitor________________________________ 15

Figure 3: physical model of the eye_________________________________________________ 16

Figure 4: ASR _________________________________________________________________ 23

Figure 5: Speech synthesis process _________________________________________________ 26

Figure 6:vocal platform scenario ___________________________________________________ 27

Figure 7: Typical usage scenario ___________________________________________________ 28

Figure 8: VoxNauta Lite Platform __________________________________________________ 29

Figure 9: VXML interpreter_______________________________________________________ 32

Figure 10: MSAA & AT _________________________________________________________ 40

Figure 11: scenario______________________________________________________________ 49

Figure 12: System Overview (1) ___________________________________________________ 51

Figure 13: System Overview (2) ___________________________________________________ 51

Figure 14: eye tracker simulator ___________________________________________________ 52

Figure 15: screen reader module ___________________________________________________ 53

Figure 16: vocal platform_________________________________________________________ 53

Figure 17: Socket Structure _______________________________________________________ 54

Figure 18: Gazing Block _________________________________________________________ 57

Figure 19: Visual Attributes_______________________________________________________ 58

Figure 20 : the test ______________________________________________________________ 67

Figure 21 _____________________________________________________________________ 70

Figure 22 _____________________________________________________________________ 72

Figure 23 _____________________________________________________________________ 73

Figure 24 _____________________________________________________________________ 75

Figure 25 _____________________________________________________________________ 76

Figure 26 _____________________________________________________________________ 77

Figure 27 _____________________________________________________________________ 78

Figure 28 _____________________________________________________________________ 79

1

INTRODUCTION

Context and Motivations

In the last years, studies about Human-Computer Interaction (HCI) achieved a better human-

machine communication, as much natural and realistic as possible. In 1968, the movie “2001- A

Space Odyssey” presented human-computer interaction as a reality impossible to achieve at that

time. Today many research efforts are in that direction, obtained by development of new software

and development of science and technique.

To develop a multimodal system, first we have to analyze the parts involved in the interaction.

Computers use reduced channels and limited interfaces to communicate with the outside world.

Human communications, on the other hand, present a large variety of elements and means.

We can just think of a dialogue between people in which, in addition to voice, gestures and sight are

used.

The ultimate goal of multimodal systems, which integrate speech and gaze, is to obtain the same

ease and robustness of human communications integrating automatic speech recognition with non-

verbal methods, such as eye-gaze tracking, to improve the output of a multimodal application.

Human-Computer Interaction addresses five communication methods: menu selection, form fill-in,

command line, natural language and direct manipulation. In this study, we will focus on language

and direct manipulation. Direct manipulation allows users to select and easily manipulate an object

or several objects in a defined context. For example, in an environment that uses a graphical

interface, direct manipulation is obtained through pointing devices, such as mouse, joystick or

keyboard. Direct manipulation requires that developers create a usable, simple, predictable and

verifiable working environment. These pointing devices and interaction channels similar to the

human ones are used to interact manipulate, and move objects. This sort of interaction requires that

all the available visual elements be used to obtain a defined mental picture of what users want to

achieve. Therefore, context and performed operations have to be clear to the user.

Eye-gaze tracking systems follow direct manipulation models, and they permit to use other sort of

devices than the traditional keyboards or mice in Human-Computer Interaction interface. Eyes

move fast and offer better input than the ones offered by pointing devices. Eyes are easy to use.

Normal users do not need particular coordination to observe objects. Furthermore, eyes allow

determining where the user is focusing his attention, and the computer as a pointing change

translates every focus change. The problems in using eyes as a computer input devices are related to

2

“unconscious” eye movements, because it is impossible to obtain a precise and aware control of eye

position. Traditional pointing devices use an event (such as button presses) to generate other events.

Therefore, eye-gaze tracking, which has to emulate traditional pointing devices, depends on a

continuous data stream that has to be manipulated, processed and interpreted while executing other

tasks. Data obtained from eye-gaze tracking bring information about screen coordinates of a point,

fixation point and time, and they have to be interpreted and processed on a spatial-temporal basis to

simulate events that fit to traditional pointing devices. Eye-gaze tracking system problems are

related essentially to the fact that the tracking equipment is less stable and accurate than other

pointing devices, because calibration has to be obtained manually.

The use of voice in Human-Computer interactions aims to obtain a communication channel similar

to the one used in human communications, to reproduce robustness, ease and naturalness of human-

human communications. Human vocal interaction is made by a listener that waits for the speaker to

complete his thought, but even if the listener does not have cognition of the environment in which

the conversation takes place, i.e., a phone call, he perceives the message and has nothing to do to

understand it. Human-Computer vocal interaction is obtained using speech synthesis and

recognition, and it is composed by several activities that require computing time. Speech synthesis,

which can be economically realized by dedicated hardware and software, receives an input text and

provides a vocal output. Speech recognition implies a matching between the input signal and a set

of model signals, but words are only recognized, not understood.

The main goal in developing multimodal systems is to combine together benefits of different

systems, trying to overcome their limitations. As said before, eye-tracking systems emulate

traditional pointing devices offering higher speed and they should be robust and inexpensive and

this is quite a challenge in the computer vision field. Today we can find very accurate eye trackers

using external devices. Most modern eye-trackers use contrast to locate the centre of the pupil and

use infrared cameras to create a corneal reflection, and the triangulation of both to determine the

fixation point. However, eye-tracking setups vary greatly; some are head-mounted, some require the

head to be stable, and some automatically track the head as well.

Speech recognition and synthesis contribution allows defining with extreme accuracy user’s

intentions and needs, but make the architecture complex. The quality of obtained speech represents

a cost factor: Loquendo extimated that the cost to produce a voice is about 50,000 euros.

Performances of a speech recognition system are speaker dependent, and based on vocabulary size

and signal. Natural speech recognition means an ulterior complexity of the system.

The goal of this study is to analyze multimodal interactions by using speech recognition and

synthesis and eye-gaze tracking methods. Our thesis proposes, by the above-mentioned motivations,

3

a multimodal architecture easy to use and open–source, to develop a prototype that integrates an

eye-gaze tracking system and a vocal platform. The tracking system was obtained through

simulation. In this way, we can integrate this prototype with universal drivers for eye gaze tracking

systems. By eye motion, we retrieve the location of the eyes in the fixation point on the screen,

which is the point in which the user is focusing his gaze. The eye-gaze tracker selects an area on the

screen, which is called gazing block and its boundaries are defined by a parameter used by the

operator that configures the tracking system. In a graphical interface, such as a computer screen,

WIMP (Windows, Icons, Menus and Pointer) paradigm represents objects belonging to the

environment. The user interacts with our application indicating the object wanted through spoken

words, specifying a command, which is a specific object name or an action.

The vocal platform manages spoken words through the vxml interpreter that is guided by the

voiceXML unit to completely and accurately interpret the result; it interprets messages sent by the

vocal platform, processes them and sends the result to the main application unit.

The voiceXML unit is developed in VXML, according to the W3C's standard for interactive voice

dialogues between a human and a computer. It allows voice applications to be developed and

deployed in an analogous way to Web applications. After receiving the recognition results, the

application performs a match between the received command and the objects selected by the eye

tracker. The result of the matching is an action performed automatically by the system.

Since the application depends largely on user’s quickness while performing actions, we evaluated

performances focusing on time for building the vxml grammar and time for searching an object in

the gazing block. We obtained values between 0,015 and 0,406 seconds for insertion, and zero

seconds for search time, which is an excellent result. The thesis was developed using VoxNauta

Lite 6.0 by Loquendo, an Italian company leader on the field of vocal applications and platforms.

Related Works

Several speech and gaze driven multimodal systems were developed in the last years.

“Speech-augmented eye gaze interaction with small closely spaced targets” by Miniotas, Špakov,

Tugoy and MacKenzie, developed a multimodal pointing technique combining eye gaze and speech

inputs. The technique was tested in a user study on pointing at multiple targets. Results suggest that

in terms of a footprint-accuracy tradeoff, pointing performance is best (~93%) for targets

subtending 0.85 degrees with 0.3-degree gaps between them.

4

Zhang, Q., Imamiya, A., Go, K. and Gao, X., in “Overriding errors in speech and gaze multimodal

Architecture” (Proc. 9th International Conference on Intelligent UserInterfaces (2004), 346-348)

consider gaze as a disambiguation channel in speech or multimodal input. Eye-gaze pattern-based

interaction systems, as any other recognition (e.g. Voice) based systems, can produce both false

alarms and misses. Some of these limitations can be overcome by developing more advanced

techniques such as statistical learning, but more importantly ambiguity will be dramatically reduced

when multiple modalities are combined due to the mutual disambiguation effects. If we expect that

eye-gaze pattern alone could be successful most of the time, its role can be expected to be even

more powerful when combined with other modalities such as speech recognition.

The COVIRDS (COnceptual VIRtual Design) System, for example, provides a 3D environment

with a multimodal interface for Virtual Reality-based CAD. Speech and gesture input were

subsequently used to develop an intuitive interface for concept shape creation (Chu et al., 1998). A

series of tasks were implemented using different modalities (zoom-in, viewpoint

translation/rotation, selection, resizing, translation, etc.). Evaluation of the interface was based on

user questionnaires. Voice was intuitive to use in abstract commands like viewpoint zooming and

object creation/deletion. Hand gestures were effective in spatial tasks (resizing, moving). Some

tasks (resizing, zoom in a particular direction) were performed better when combining voice and

hand input. The command language was very simple and the integration of modalities was

implemented at syntax level. Therefore in some cases users showed preference for a simple input

device (a wand with 5 buttons) rather than for multimodal input.

A multimodal framework for object manipulation in Virtual Environments is presented in (Sharma

et al., 1996). Speech, gesture and gaze input were integrated in a multimodal architecture aiming at

improving virtual object manipulation. Speech input uses a Hide-Markovian-Model (HMM)

recognizer while the hand gesture input module uses a pair of cameras and HMM-based recognition

software. Speech and gesture are integrated using a fixed syntax: <action> <object> <modifier>.

The user command language is rigid so to allow easier synchronization of input modalities. The

synchronization process assumes modality overlapping: the lag between the speech and gesture

input is considered to be at most one word. The functionality of the gaze input is reduced to

providing complementary information for gesture recognition. The direction of gaze, for example,

can be exploited for disambiguating object selection. A test-bench using speech and hand gesture

input was implemented for visualization and interactive manipulation of complex molecular

structures. The multimodal interface allows a much better interactivity and user control compared

with the unimodal, joystick-based, input.

5

PART I

In this part we describe the elements that allow us to define the problem.

In the first section we describe multimodal interaction applications and their use.

The second section refers to eye-gaze tracking systems.

The third section details vocal applications and their use.

6

Multimodal Interactions

Research on Multimodal Systems is based on the expectation that Human-Computer Interaction can

benefit from modeling several communication modalities in analogous ways than those used by

humans. In a communication between two people, both of them are aware of spoken language,

content, prosody and pitch; but they are also aware of position, facial expressions and environment.

Ease and robustness of Human-to-Human communications is due to high recognition accuracy by

using mmultiple input channels and redundant and complementary modalities. In Human-Computer

communication, both user’s intentions and both computer outputs have to be translated, by using

several algorithms, in a easy way, to be understood by both parts. In the figure below we can see an

example of human computer interactions:

Figure 1:Human-computer interactions

The user communicates with the Computer using physical devices, such as keyboard or mouse. To

achieve that, the user activates several motor muscles, like hands or vocal ones. The information,

sent by the Computer through input devices, can be translated into different abstraction levels,

providing different understanding levels of user’s intentions. The computer communicates with the

user using several output devices, like screen or speakerphones. On these output devices the

computer can send static data (static images or prerecorded audio) or dynamical data (text, images,

vocal synthesis).

In the following lines we give some definitions concerning multimodal systems.

- Multimodal systems represent and process information from different human-communication

channels on multiple abstraction levels. Multimodal systems can automatically extract meaning

from multimodal input data, and conversely they produce perceivable information from symbolic

abstract representations.

7

- Multimodal Interface: it is an interface that combines several input and output modalities. The

goal is to facilitate human-computer interaction using the same communication channels that people

naturally employ when they communicate, mainly is the voice.

As we said, the use of different modalities to provide complementary information can facilitate

interaction: for instance, references to graphic objects are easier to speak than to choose from a

menu by using pointing devices. So, combining speech recognition and eye tracking can increase

interpretation accuracy.

Here we examine strength and weaknesses in modalities across tasks:

Speech Input It is preferable over traditional input modalities

such as keyboard or mouse, in tasks where

hands are occupied (car navigation), where

mobility is necessary (equipment check ups) or

simply where speech input is more convenient

(automated call centers). Speech input is not

preferable in intrinsically visual tasks

(navigation, drawing tasks...)

Gesture Input It is preferable in resolving object references

and indicating the scope of commands.

Handwriting input It can be more efficient for numerical data, is

preferable for note-taking and form-filling, but

is difficult to be recognized

Eye-Gaze tracking Gaze tracking is efficient in situation where

hands use is limited and where people can not

use voice, but produces ambiguity

8

Evaluation of Multimodal Systems

Multimodal systems are particularly challenging to evaluate.

This section introduces in the challenges and methodologies of evaluating multimodal systems.

Depending on what goal an evaluation pursues, we can distinguish three broad categories of

evaluations.

1. Adequacy evaluations determine the fitness of a system for a purpose: Does it meet the

requirements? How well? And at what cost? The requirements are mainly determined by user

needs. Therefore, user needs have to be identified.

2. Diagnostic evaluations obtain a profile of system performance with respect to possible uses of a

system. It requires the specification of an appropriate test.

3. Performance evaluations measure system performance in specific areas.

Three basic components of a performance evaluation have to be defined:

a) Principle: what characteristic or quality are we interested in evaluating (accuracy, learning)?

b) Measure: by which specific system property do we report system performance for the chosen

principle?

c) Method: how do we determine the appropriate value for a given measure and a given system?

Some criteria have emerged as standards for performance evaluations of different components of

multimodal systems:

1. Interactive services, data input applications: task completion time and success rate, rate of

unimodal versus multimodal interactions, complexity of interactions.

2. Eye-gaze tracking: tracking accuracy (percent deviation from true position) and tracking success

(ratio of time when feature tracked versus when feature is lost).

9

Eye-Gaze Tracking

Using the term “eye-gaze tracking”, we indicate generally a query system that uses different

techniques to measure both gaze point and eye movements with respect to a measuring system.

If the measuring system is head mounted, then we measure eye-in-head angles and we refer to “eye

tracking”. If the measuring system is table mounted, as table mounted camera systems, then we

measure gaze angles and we refer to “gaze tracking”.

In this paper, we consider a generic eye-gaze system; therefore, we will use this term referring to

the system considered in our application.

10

History of Eye Tracking in HCI

The study of eye movements pre-dates the widespread use of computers by almost 100 years (for

example, Javal, 1878/1879). Beyond mere visual observation, initial methods for tracking the

location of eye fixations were quite invasive – involving direct mechanical contact with the cornea.

Dodge and Cline (1901) developed the first precise, non-invasive eye tracking technique, using

light reflected from the cornea.

In the 1930s, Miles Tinker began to apply photographic techniques to study eye movements in

reading. In 1947 Paul Fitts began using motion picture cameras to study the movements of pilots’

eyes as they used cockpit controls and instruments to land an airplane. The Fitts et al. study

represents the earliest application of eye tracking to what is now known as usability engineering.

Around that time Hartridge and Thompson (1948) invented the first head-mounted eye tracker. Eye

movement research and eye tracking flourished in the ‘70s, with great advances in both eye tracking

technology and psychological theory to link eye tracking data to cognitive processes. Work in

several human factors / usability laboratories (particularly those linked to military aviation) focused

on solving the shortfalls with eye tracking technology and data analysis during this timeframe. Two

joint military / industry teams (U.S. Airforce / Honeywell Corporation and U.S. Army / EG&G

Corporation) developed a remote eye tracking system that dramatically reduced tracker

obtrusiveness and its constraints on the participant. At this point, eye tracking was not yet applied to

the study of human-computer interaction. In the 80’s, as personal computers proliferated,

researchers began to investigate how the field of eye tracking could be applied to issues of human-

computer interaction (like search for commands in computer menus). Early work in this area

initially focused primarily on disabled, then in military aviation again, developing flight simulators

to simulate an ultra-high resolution display, by providing high resolution wherever the observer was

fixating and lowering resolution in the periphery. In more recent times, eye tracking in human-

computer interaction has shown wide growth both as a means of studying the usability of computer

interfaces and as a means of interacting with the computer.

11

Eye – gaze-tracking methods

One of the most significant obstacles, employing eye tracking in usability studies, is the choice

between intrusive and non-intrusive eye tracking systems. Intrusive eye trackers require the use of

equipments mounted on user’s head; they constrain movements and are uncomfortable. The user

has to wear a band or helmet, or the system can be mounted on glasses.

A better solution is represented by tracking systems that do not need devices connected to the user

and that guarantee more freedom in movements: non-intrusive eye trackers. Non-intrusive eye

trackers or so-called “remote eye tracking systems” are mounted on the space in front of the user

(i.e., on the monitor or on the desktop). They capture infrared light reflections from both the cornea

and the retina, using a camera. In this way, user’s movements are not constrained but tracking loss

is possible, and it is essential to re-acquire eye tracking manually. Remote eye tracking systems

operate with a camera and sometimes with a LED (emitting infrared light), depending on the

method used to predict the fixation point. Vision Key integrates an infrared camera on glasses.

Other systems use low-cost cameras, like common webcams, that can be used by many users but

that supply less precision in tracking.

The first remote eye tracking system used multiple cameras, with a double camera system.

Morimoto describes an eye tracking system with single camera and accuracy at 3 degrees. Recently,

several academic groups have built similar single-camera systems. The main additional difficulty in

the single-camera setting is determining the distance of the user from the camera, since a

triangulation as in the multi-camera case cannot be carried out. The advantage of a single camera

system is of course the reduced cost and smaller size.

12

Remote Eye-Gaze Tracker

An example: Infrared eye-gaze tracking

Here we briefly describe a method used for remote eye tracking, as explained in “Remote Eye

Tracking: State of the Art and Directions for Future Development”, by Böhme, Meyer, Martinetz,

Barth – COGAIN 2006.

Figure 2: infrared tracking system mounted on a monitor

The system consists of a high-resolution camera (1280x1024 pixels) and two infrared LEDs

mounted to either side of the camera. The LEDs provide general illumination and generate

reflections on the surface of the cornea. These corneal reflexes (CRs) are used to find the eye in the

camera image and determine the location of the centre of corneal curvature in space.

The system is mounted below an LCD display.

The software consists of two main components: the image processing algorithms that are used to

determine the position of the CRs and pupils in the image, and the gaze estimation algorithm, which

uses this data to compute the direction of gaze.

13

Figure 3: physical model of the eye

The gaze estimation is realized through a physical model of the eye, which models the optical

properties of the cornea (reflection and refraction), the location of the pupil centre (PC) and centre

of corneal curvature (CC), and the offset of the fovea (and hence the line of sight) from the optical

axis. Given the observed positions of the CRs and of the pupil centre in the camera image, there is

only one possible position and orientation of the eyeball that could have given rise to these

observations. The gaze estimation algorithm deduces this position and orientation, and then

intersects the direction of gaze with the display plane to determine the location the user is fixating.

14

Commercial Solutions

IKey and GazeTalk try to improve text typing, the first one with an on screen keyboard and the

second one with a 4X3 matrix button.

Tobii, ECKey and Ikey require tools to supply selection and use low cost cameras.

Tobii was the first remote eye tracking system with high definition (0.5 - 1 degree) and good user

movement tolerance.

With EyeGaze System by LC Technologies Inc. image elaboration and computation of the fixation

point are done by software installed on the user’s pc. Speed sampling is about 60Hz. EyeGaze has

to be used far from windows and with a limited number of infrared lights.

Quick Glace Eye-tracking System by Eye Tech Digital System is a device that emulates the mouse

using eye movements. A webcam mounted under the monitor focuses user’s eye. Two infrared LED

are mounted at the side of the monitor. Quick Glace sets a fixation point analyzing the corneal

reflections and the pupil’s center. Eye tracking speed sampling is about 30 samples per second.

ERICA System (Eyegaze Response Interface Computer Aid System) by ERICA Inc. allows tracking

and memorizing eye movement and pupil dilatation. ERICA is a non-intrusive system: it uses a

camera and LED mounted under the monitor. System speed is 60 Hertz.

15

Future Developments

The large number of proposed systems for dealing with technical challenges proves that no solution

is completely satisfying. The following points may summarize these technical challenges:

- Monocular (single camera) / binocular (double cameras) approach.

In monocular approach, the difficulties associated with multiple cameras are limited. These

difficulties turn around the complexity of computation arising from multiple cameras and the

controlling the pan, tilt cameras and the hardware. More generally, limiting the number of

cameras would reduce the difficulties associated with multiple cameras and would reduce the

cost. However, reducing the number of cameras also means reducing the quality of tracking.

- Automatic calibration

Typically, the current state of art techniques, mainly non-wearable eye tracking systems

require some form of user cooperation and calibration. Additionally, they generally require

multiple and high-resolution cameras. The challenge is to make these calibrations automatic,

or at least to limit then at maximum manual calibration.

- Automatic and non-intrusive system

The primary obstacle to integrate these techniques in several applications is that they have

been either too invasive or too expensive for routine use. Recently, the invasiveness of eye

gaze and the eye fixation tracking has been significantly reduced with advances in the

miniaturization of head-mounted video-based eye-trackers. Remote video-based eye tracking

techniques also minimize intrusiveness; however it suffers from reduced accuracy with

respect to head-mounted systems. Therefore, the challenge is to investigate remote video-

based eye tracking in order to be of similar accuracies as head- mounted video-based eye

tracking.

- Cost reduction.

An important obstacle of the general use of such techniques is the cost. Currently, a number of

eye trackers are available on the market and their prices range from approximately 5000 to

40000 euros. Notably, the bulk of this cost is not due to hardware and software, as the price of

high-quality digital camera has dropped precipitously over the last ten years. Rather, the costs

are associated with custom software implementations, sometimes integrated with specialized

digital processors, to obtain high-speed performance.

This analysis clearly indicates that in order to integrate eye tracking into every day human-

computer interfaces, the development of widely available, reliable and high-speed eye-tracking

algorithms that run on general computing hardware need to be implemented. Towards this goal,

investigating an eye-tracking method in an open source package may be an efficient approach to

16

reduce costs. In combination with remote video eye tracker, and low-cost head mounted eye-

tracking systems, there is a significant potential that eye tracking will be successfully incorporated

into the next generation of multimodal human-computer interfaces. A way to reduce this cost is to

investigate web camera or low cost camera.

17

Voice

In this chapter we analyze vocal applications and their use.

Vocal applications use Automatic Speech Recognition (ASR) and Speech Synthesis methods

18

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is a technology that allows a computer to identify the words

that a person speaks into a microphone or telephone. The goal of ASR research is to allow a

computer to recognize in real-time with 100% accuracy all words that are intelligibly spoken by any

person, independent of vocabulary size, noise, speaker characteristics and accent, or channel

conditions. Despite several decades of research in this area, accuracy greater than 90% is only

attained when the task is constrained in some way. Depending on how the task is constrained,

different levels of performance can be attained; for example, recognition of continuous digits over a

microphone channel (small vocabulary, no noise) can be greater than 99%. If the system is trained

to learn an individual speaker's voice, then much larger vocabularies are possible, although

accuracy drops to somewhere between 90% and 95% for commercially available systems. For

large-vocabulary speech recognition of different speakers over different channels, accuracy is no

greater than 87%, and processing can take hundreds of times real-time (from “Speech and Language

Processing: An introduction to natural language processing, computational linguistics, and speech

recognition”, Daniel Jurafsky & James H. Martin).

19

History

It is said that the first experience in speech recognition was made in a toy factory: Radio Rex was a

dog-toy able to recognize its name when called.

In 1940's US Government with DARPA began research in order to intercept and automatically

translate soviet messages. Research continued afterwards with Speech Understanding Research

(SUR) foundation at the Carnegie Mellon University.

The first Speech recognition system was developed in 1952 at Bell's Labs: it was an analogic

system that could recognize numbers for 0 to 9 with high accuracy (98%) and few years later

recognition of letters began. At the beginning of 70's SUR had the first digital results with Harpy

that recognized complete phrases in a predefined set of grammatical structures.

First commercial systems had to deal with three main problems: complexity in terms of both

computing power and recognition of generic speaking, and recognition of natural speaking.

Two companies, Speechworks and Dragon Systems, were the first ones to bring speech recognition

on the market, developing software able to function with the calculus power of normal informatic

systems.

In 1997, Dragon Systems produced Naturally Speaking, the first software of dictation for

continuous speaking. In 2000 TellMe presented a voice-driven portal.

20

ASR Structure

The basic ASR structure is shown in the image below:

Figure 4: ASR

At the front-end, the signal is compressed in a data stream.

An acoustic model and a pronunciation model allow hypothesizing the structure of a sentence.

The pronunciation model and its dictionary determine the semantics and the syntax of the sentence

to be recognized, while the acoustic model gets the mapping between the data stream and single

words. Both models can be adapted to limit errors coming from recognition, modeling the output on

the signal.

Pronunciation Model

Templates for a given dictionary compose the pronunciation model.

Every ASR user-depending system has to create an acoustic model training the system to recognize

the pronunciation of every single word. Independent-user systems use generic models to recognize

speech. They are designed combining together different templates for different users.

Accuracy is lost because these systems are not specifics.

21

Continuity of Speech

The other main characteristic that an ASR must have is continuity of speech, which is the

possibility of recognizing words in a “natural rhythm”. To improve accuracy we need informations

on grammatical rules.

Applications of Speech Recognition

Speech recognition provides mainly two applications:

- Command Recognition

- Dictation

Command recognition identifies applications in which user interacts with machine, supplying input

instructions to obtain an action; dictation transforms voice in text.

Command recognition

Command recognition can be developed in many fields, though it is now almost limited due to

personal computing. The principal application fields are the Interactive Voice Response (IVR)

systems, in which the user, through a phone, selects options from a voice menu and interacts with

the computer phone system. IVR are now extended with Voice Portals, phone services through

which users can access to the Web to retrieve informations about weather or market quotations.

Loquendo developed the Voice Portal “DimmiTutto” for Telecom Italia.

The Center for Spoken Language Research at the University of Colorado developed Voice Portals

that enable natural conversations between people and machines to accomplish specific tasks, such

as accessing live airline, hotel, and car rental availability and pricing.

Dictation

In Dictation a machine, using speech algorithms in an attempt to transcribe a document analyzes

audio. Dictation applications of ASR require literal transcriptions of speech in order to train both

speaker independent and speaker adapted acoustic models. Literal transcriptions may also be used

to train stochastic language models that need to perform well on spontaneous speech. These

22

applications require also developing a dictionary that cannot be used for other speakers and a quiet

environment ensuring that background noise will not contaminate the dictionary. Since human

typing is 40 or 60 words per minute, computerized note taking systems can improve to three times

better. IBM ViaVoice and Nuance Dragon NaturallySpeaking allow users to create documents or

email just speaking, with a 99% of accuracy and 250 words-per-minute without training.

23

Speech Synthesis

Figure 5: Speech synthesis process

Speech synthesis is the artificial production of human speech.

A speech synthesizer is a process that converts an input text, with words or phrases, in a waveform

using specific algorithms and waveform blocks predefined. Speech synthesizers systems differ for

stored speech units, encoding or speech synthesis methods.

The most important qualities of a speech synthesis system are naturalness and intelligibility.

Naturalness describes how closely the output sounds like human speech, while intelligibility is the

ease with which the output is understood. Usually speech synthesis tries to maximize both

characteristics.

Speech Synthesis applications are Text-To-Speech and Voice Response:

- Text-To-Speech (TTS) system converts normal language text into speech.

- Voice Response system uses mainly signal processing techniques and decodes texts in input

having a limited syntax and dictionary.

24

Vocal Platforms

Figure 6:vocal platform scenario

As said before, a vocal platform provides an interaction between users and machines by using the

phone. The Interactive Voice Response systems allow the telephone caller to select options from a

voice menu and interact with the computer phone system. IVR systems use DTMF signals and

natural language speech recognition to interpret the caller's response to IVR prompts.

These systems are used to create and manage services such as telephone banking, order placement,

caller identification, airline ticket booking. They are usually used at the front end of call centers to

identify which service the caller wants and extract numeric information.

The automatic call center can answer to phone calls and manage a conversation with the user

through user's identification, managing incoming calls, managing user's answer and providing

informations based on user's answers. IVR criticisms are due to poor design and lack of

appreciation of callers' needs.

25

Loquendo VoxNauta

Platform Overview

VoxNauta Lite platform is a voice gateway that allows an application to be accessed by means of

on audio device (i.e. a telephone set), using the voice and the audio as input and output format.

Figure 7: Typical usage scenario

For simplicity, we imagine a situation where both the ASR and the TTS are present.

The end user, by means of a normal telephone set, a cellular phone, or a VoIP device, calls the

telephone number corresponding to a vocal service that has been properly deployed and scheduled

on the platform. The platform interacts on one side, through the telephone (or IP) network, with the

end-user (playing audio prompts, synthesizing speech and accepting vocal input), and on the other

side, via HTTP, with the VoiceXML web application. The phone call placed by the user is managed

by the system (first by the Front End - for example the telephone board - and then by the VoxNauta

Lite modules) that executes the service logic by interacting with external web servers and databases

where the business logic and the information are located. The output supplied by the application

(VoiceXML code) is then processed by the system and rendered to the user, through the phone line.

The underlying paradigm is that a VoiceXML-based voice application, technically speaking, is

nothing different from a “classic” web application, thereby allowing a pervasive exploitation of

preexisting skills on development tools, programming languages and technologies.

26

Simply, the web application generates VoiceXML code instead of HTML code, and instead of

having a HTML Interpreter (a browser like Internet Explorer) to interpret the code in order to

present appropriately the content to the user, there is a VoiceXML Interpreter (that is, a VoiceXML

browser) to process the VoiceXML code. The output is rendered to the user by the TTS engine (or

by means of pre-recorded audio files). Instead of clicking a mouse button or typing characters on a

keyboard, the user provides the input by means of either DTMF tones or vocal commands.

Platform architecture

The voice platform software functional architecture has been conceived in a modular form.

Figure 8: VoxNauta Lite Platform

The various modules are detailed in the following paragraphs.

A. Front End interfaces

With the expression “Front End” it is intended the third party’s HW/SW module that interfaces on

one side the “external world” (typically the telephone or IP network) and on the other side the core

of the platform.

27

B. Interfaces directly supported by Loquendo

VoxNauta Lite 6.0 supports directly several kinds of front-end interfaces:

• Analog telephone (Loopstart), based on either NMS or Intel Dialogic technology.

An appropriate telephone board (NMS AG2000, Intel Dialogic D/41JCT-LS or D/120JCT-LS) as

well as the NMS or Intel Dialogic runtime software environment, is required.

• Digital telephone Interface (EuroISDN) based on NMS technology.

An appropriate telephone board (NMS AG4000 or AG4040), as well as the NMS runtime software

environment, is required.

• Voice over IP: it is based on the commonly used SIP and RTP protocols.

The Front End is given by the PC Ethernet interface. Neither dedicated boards nor third party’s

software are required.

The interoperability between the the above-mentioned front-end interfaces and VoxNauta Lite

platform is supplied by the Loquendo’s Audio Providers.

C. VoxNauta Lite modules

VoxNauta Lite modules are:

• Audio Provider.

The Audio Provider (AP) is the component that allows the interoperability between the Speech

Server (see next point) and the Front End, managing the data stream between the two entities.

Moreover the Audio Provider pilots the Front End handling all the call control (connection,

disconnection, call transfer, and so on). VoxNauta Lite provides three different Audio Providers:

each one handles a particular type of Front End.

In detail:

- NMS Audio Provider:

this AP handles a NMS-based Front End, for both digital and analog interfaces.

- Intel Audio Provider.

This AP handles an Intel Dialogic based Front End.

- VoIP Audio Provider:

this AP handles a Voice over IP based Front End.

28

All the Loquendo’s Audio Providers are able to manage several input audio formats: a-law, µ-

law, pcm, wav and mp3.

• Speech Server (sometimes called also Voice Server): it is the “core” module that pilots the TTS

and the ASR as well as the Audio Provider and VoiceXML Interpreter, thus allowing runtime

interaction between all these components. Graphically, this module offers a window for monitoring

the status of the logical channels.

• VoiceXML Interpreter: This module processes the VoiceXML pages fetched from the application

server, and properly interacts with the Speech Server.

• Runtime Tools. This module includes a set of tools for operation and maintenance activities.

The tools are:

- Service Scheduler: this tool allows the user to decide which vocal service has to be invoked

whenever an incoming call is detected.

- Front End Monitor: this tool allows monitoring the status of the physical lines from the front-end

point of view.

- Control Bar: this tool allows some OA&M activities like Start/Stop of the platform, export of logs

and statistics, service provisioning and so on.

- Web Reporting: this tool allows obtaining information regarding a session of calls.

- Service Cockpit: this tool gives set of information about voice services behavior from the

recognition point of view. The Cockpit allows analyzing the recognition results in order to monitor,

tune and optimize the ASR operativity.

- DAP Tester: this tool is designed for users that intend to implement the DAP protocol without

using any of the Loquendo’s Audio Providers. The DAP Tester allows verifying the platform

functionalities right after platform installation.

VoiceXML Interpreter

The VoiceXML Interpreter of the VoxNauta Lite platform enables end-users to access web-based

information and to perform transactions entirely by voice. The Interpreter is the browser core

component that reads and processes VoiceXML “pages”. Moreover, it gives service providers real-

time control of multiple VoiceXML applications being accessed simultaneously by multiple users.

VoxNauta Lite VoiceXML interpreter is adherent to the W3C specifications and it is fully

compliant to version 2.0 of the language (W3C Recommendation, 16 March 2004).

29

VoiceXML Interpreter interfaces

Figure 9: VXML interpreter

The VoiceXML Interpreter interacts on one side via HTTP Requests and Responses with the

application server in order to fetch service resources like VoiceXML pages, and on the other side,

by means of an internal proprietary protocol, with the Speech Server module.

VoiceXML

Just as HTML is the standard language for developing traditional web sites, VoiceXML is the open

standard XML-based markup language explicitly dedicated to the development of internet-powered

telephone applications. While HTML assumes a graphical web browser with display, keyboard, and

mouse, VoiceXML assumes a voice browser with audio output, audio input, and keypad input

(DTMF).

VoiceXML evolution

In the past, to achieve integration between telephony and Internet world, every company developed

its own proprietary mark-up language. This situation was a big barrier to the evolution of voice-

based services: high costs for training, difficult technologies integration and lack of flexibility.

30

For this reason AT&T, Motorola, IBM and Lucent technologies started team working to develop a

standard language, joining the best features of the mark up languages they developed, and they

called it VXML. Subsequently name was changed to VoiceXML and a VoiceXML Forum was

organized to further develop the language as an open standard with the supervision of World Wide

Web Consortium (W3C). VoiceXML 1.0 specifications were officially released in May 2000 by

W3C consortium. VoiceXML 2.0 reached the W3C's "Recommendation" level on March 16, 2004.

(http://www.w3.org/TR/voicexml20)

The VoiceXML language allows, from the server side of the application, to structure the dialog

application as a complete mixture of static pages and dynamically generated ones. Even the

grammars may be located on a Web Server and may be dynamically generated during the course of

the dialogue interaction.

Main features and strengths of VoiceXML

The most interesting characteristic of VoiceXML is that it allows creating vocal application through

the same model of the “classic” Internet applications development, separating the contents and the

business logic from the presentation layer. The development process of a vocal application is very

similar to the development process of a traditional HTML-based web application, with a deep reuse

of skills and resources. Moreover, integration between HTML and VoiceXML application is simply

a matter of presentation layer, thus allowing adding quickly vocal access to pre-existing application

and services, without any need to replicate data and business logic.

Summarizing, main strengths of VoiceXML language are the following:

• Standard language: VoiceXML is a W3C standard widely adopted. Therefore VoiceXML based

applications can be applied on any VoiceXML compliant vocal platform.

• Web integration: VoiceXML allows bringing the full power of web development and content

delivery to voice response applications. The developers do not need to deal with low level

programming and resource management, and so they are free to concentrate their efforts to the

business logic.

• Skill reuse: VoiceXML, based on XML standard, is similar to others markup languages of the last

generation and so any developer specialized in Web technologies will find it quite familiar.

The main features of VoiceXML language are the following:

• Presence of advanced menu and form filling dialog components and possibility to structure the

voice application by using sub-dialogs for reusable components.

• Several information related to the telephone call (Session Variables) available to be used by the

application.

31

• Possibility for the user to record messages that can be managed in voice application.

• Support to different kinds of transfer call functionalities.

• JavaScript support that allows the voice application to make elaborated decisions and to

manipulate data on the client side (for example to validate user input).

• Multi-level handling of errors, helping messages and other application specific events.

• Support to Speech Markup elements (SSML), which fine-tune the user experience for the TTS

messages and fully supported pre-recorded audio prompts.

• Support to SRGS grammars (both vocal and DTMF), the XML-based standard grammar

formalism.

• Possibility to switch among different Text-To-Speech voices and languages.

• Extended multi-level grammars, which allow advanced features such as mixed-initiative dialog

interaction and different level of active hyperlinks.

• Rich set of built-in grammars (boolean, date, time, digits, phone, currency, etc.). The availability

of these special grammars depends on the language.

• Complete support to DTMFs in VoiceXML dialogs.

Here are two short examples of VoiceXML. The first is the venerable "Hello World":

<?xml version="1.0" encoding="UTF-8"?>

<vxml xmlns="http://www.w3.org/2001/vxml"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <block>Hello World!</block> </form>

</vxml> The top-level element is <vxml>, which is mainly a container for dialogs. There are two types of

dialogs: forms and menus. Forms present information and gather input; menus offer choices of what to

do next. This example has a single form, which contains a block that synthesizes and presents

"Hello World!" to the user. Since the form does not specify a successor dialog, the conversation

ends. Our second example asks the user for a choice of drink and then submits it to a server script:

<?xml version="1.0" encoding="UTF-8"?>

32

<vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0">

<form> <field name="drink"> <prompt>Would you like coffee, tea, milk, or nothing?</prompt> <grammar src="drink.grxml" type="application/srgs+xml"/>

</field>

<block>

<submit next="http://www.drink.example.com/drink2.asp"/>

</block>

</form>

</vxml>

A field is an input field. The user must provide a value for the field before proceeding to the next element

in the form. A sample interaction is:

C (computer): Would you like coffee, tea, milk, or nothing?

H (human): Orange juice.

C: I did not understand what you said. (a platform-specific default message.) C: Would

you like coffee, tea, milk, or nothing?

H: Tea

C: (continues in document drink2.asp)

GRAMMARS

A grammar identifies different words or phrases that a user might say and (optionally) specifies how

to interpret a valid expression in terms of values for input variables. Grammars can range from a

simple list of possible words to a complex set of possible phrases. Each field in a form can have a

grammar that specifies the valid user responses for that field. An entire form can have a grammar that

specifies how to fill multiple input variables from a single user utterance. Each choice in a menu has

a grammar that specifies the user input that can select the choice. A grammar contains one or more

rules that specify matching input, usually with one rule specified as the root rule of the grammar. If

the grammar has a root rule, then you can use the grammar in your VoiceXML application without

naming which rule to start from. In some grammars, there is exactly 1 top-level rule that can be used

as a starting point for that grammar. For example, a simple yes/no grammar might consist of a single

rule, allowing "yes", "no", and various synonyms, such as "yep", "nope", or "no way". In this

grammar, the root rule would, of course, be that one starting point.

33

A larger or more complex grammar, however, may have several rules that can be used as a starting

point. For example, consider a grammar for recognizing marine animals. It could have 1 rule that

recognizes all marine animals. That rule might itself be composed of rules that recognize smaller sets

of animals, such as one for marine mammals, another for types of coral, and a third for species of

fish. This marine animals grammar might allow you to specify one of these subrules as the starting

point, instead of always using the complete grammar. To use the subrules, you'd have to ask for the

rule by name. This grammar might still identify one rule as the root rule, so that you could ask for the

grammar without specifying a rule by name. A VoiceXML application specifies a grammar to use

with the VoiceXML <grammar> tag. It can use built-in grammars and application-defined grammars.

A built-in grammar is one that is built directly into the VoiceXML interpreter. You can use these

grammars without any coding effort. An application grammar, on the other hand, is one that a

developer defines from scratch. An application grammar may be a grammar you've defined

specifically for a particular application or it may part of a general library of grammars you have

access to for reuse in multiple VoiceXML applications. Application grammars can either be inline or

external. The entire definition of an inline grammar appears directly in the <grammar> element of the

VoiceXML document; the definition of an external grammar appears in a separate file. If the

<grammar> element contains content, that content is the definition of an inline grammar. If the

element does not contain content, it must have a value for either the src attribute or the expr attribute.

In this case, depending on the value of that attribute, the reference is to either a built-in grammar or

an external grammar file.

34


Most important companies involved in speech recognition and synthesis are SpeechWorks, Nuance,

Scansoft, IBM and Loquendo.

SPEECHWORKS (www.speechworks.com) is specialized in Voice Portal design and applications

for service provider that use speech synthesis recognition and verification. SpeechWorks was

among the first societies to bring on the market products based on standard VXML and study, with

net technologies, embedded technologies to supply vocal recognition on cars, mobile devices and

cell phones.

NUANCE (www.nuance.com) deals with vocal recognition and synthesis, vocal authentication and

vocal browsing products (mostly through phone).

SCANSOFT (www.scansoft.com) deals with software for imaging and vocal recognition.

Dragon NaturallySpeaking by Scansoft is now among the best applications for Dictation e

Command and control.

PHILIPS. Telephony and Voice Control Business Unit by Philips is dedicated to speech

recognition (www.speech.be.philips.com ) since more than forty years is active in research about

dictation, TTS and vocal recognition. It deals with car vocal control, mobile devices and

electronics.

IBM Voice System is an IBM business unit dedicated to speech recognition and produces ViaVoice

and WebSphere. Since forty years it is involved in search and development of more than 150

brevets software.

LOQUENDO. (www.loquendo.com), a Telecom Italia company in Turin, was born in 2001; it

proposes solutions for vocal services developing, call center automation, voice in Intranet and

Unified Messaging.

WAYCOM. Waycom (www.waycom.it), an Italian company, offers internetworking, messaging and

Voice Portals

35

Screen Readers

A screen reader is a software application that attempts to identify and interpret what is being

displayed on the screen. This interpretation may then be represented to the user with text-to-speech,

sound icons, or a Braille output. Screen readers are a form of assistive technology (AT) potentially

useful to people who are blind, visually impaired, or learning disabled, often in combination with

other AT such as screen magnifiers. A person's choice of screen reader is due to many factors,

including platform, cost (even to upgrade a screen reader can cost hundreds of U.S. dollars), and the

role of organizations like charities, schools, and employers. Screen reader choice is contentious:

differing priorities and strong preferences are common. Operating system and application designers

have attempted to address these problems by providing ways for screen readers to access the

display contents without having to maintain an off-screen model. These involve the provision of

alternative and accessible representations of what is being displayed on the screen accessed through

an API.

Screen readers can query the operating system or application for what is currently being displayed

and receive updates when the display changes. For example, a screen reader can be told that the

current focus is on a button and the button caption to be communicated to the user. This approach is

considerably easier for screen readers, but fails when applications do not comply with the

accessibility API: for example, Microsoft Word does not comply with the MSAA API, so screen

readers must find another way to access its contents. One approach is to use available operating

system messages and application object models to supplement accessibility APIs: the Thunder

screen reader operates without an off-screen model in this way. Screen readers can be assumed to

be able to access all display content that is not intrinsically inaccessible. Web browsers, word

processors and email programs are just some of the applications used successfully by screen reader

users. However, using a screen reader is, according to some users, considerably more difficult than

using a GUI and many applications have specific problems resulting from the nature of the

application (e.g., animations in Macromedia Flash) or failure to comply with accessibility standards

for the platform (e.g., Microsoft Word and Active Accessibility).

36

Assistive Technology

Assistive technology (AT) devices are designed to improve accessibility for individuals who have

physical or cognitive difficulties, impairments, and disabilities. An AT device sits between a user

and an application and allows the user to interact more successfully with that application. The

devices translate the application data into a format that the user can access and interact with and, in

turn, render the user's input into a format that the application can interpret. To function effectively,

AT devices must be compatible with the computer operating system and programs on the particular

computer being used.

Accessibility

Accessibility refers to the ability of a user, despite disabilities or impairments, to use a resource. In

the computer industry, accessibility standards and guidelines help to ensure that any computer user,

despite impairment, can experience at least the minimum functionality that the computer resource

(hardware and software) is capable of providing. For Internet-based applications, accessibility

means that all users can perceive, understand, navigate, interact with, and contribute to the Web.

Microsoft recognizes its responsibility to develop technology that is accessible and usable to

everyone, including those with disabilities, and is committed to educating developers on how to

create accessible technology. Perhaps most importantly, Microsoft is committed to providing all

users with an equivalent software experience. Note that equivalent does not mean exactly the same.

The goal is an equivalent measure of usefulness and access to features that can achieve the desired

result. Therefore, when designing applications, we should consider the needs of all potential users,

including older people who use large fonts, users with low vision who use magnifiers, users with

color blindness; users who are sight impaired, users with mobility impairments who use specific

input devices, users who are deaf or hard of hearing, users with cognitive and reading disabilities,

and so on.

There are three basic rules to make software accessible:

1. Make the user interface work with assistive technology.

2. Make the user interface work with operating system settings. Examples in the Microsoft

Windows operating system include high contrast mode and large fonts.

37

3. Design the user interface so that, by default, it works for as many users as possible. For example,

use reasonably sized fonts and enough color contrast between text and backgrounds.

Use of DOM and MSAA

Fortunately, developers do not need to understand each AT device. AT devices use a standard

object model, such as the Document Object Model (DOM), or a set of interfaces, such as Microsoft

Active Accessibility (MSAA), to communicate with a client application running on Windows or

with an application running on the Web (see Figure 10). A wide variety of AT devices are built on

this common base.

Figure 10: MSAA & AT

MSAA is a set of COM interfaces and application program interfaces (APIs) that provides a reliable

way to expose and collect information about Microsoft Windows-based UI elements and Web

content. AT devices can then use this information to communicate the UI in alternative formats,

38

such as voice or Braille, and voice command and control applications can remotely manipulate the

interface. MSAA is a COM-based technology designed to improve the way accessibility aids work

with applications running on Microsoft Windows. Accessibility aids may include screen readers for

the visually impaired, visual indicators or captions for people with hearing loss, software to

compensate for motion disabilities, etc. Active Accessibility provides dynamic-link libraries that

are incorporated into the operating system as well as a COM interface and application programming

elements that provide reliable methods for exposing information about user interface elements.

Microsoft Active Accessibility was originally made available in April 1997 in the form of a Re-

Distributable Kit (RDK) that included updated operating system components for Microsoft

Windows 95. Since Windows 98 and Windows NT 4.0 Service Pack 4, Active Accessibility has

been built-into all versions of the Windows platform, and has received periodic upgrades over time

to increase its robustness and versatility. A new managed code accessibility API known as

Microsoft UI Automation (UIA) in addition to Active Accessibility is being introduced.

39

MSAA interfaces and methods IAccessible

The IAccessible interface is the heart of Microsoft Active Accessibility. Applications implement

this Component Object Model (COM) interface to represent their custom user interface elements,

which can include their client area as accessible objects, if necessary. Applications call IAccessible

Navigation and Hierarchy

To move through the Active Accessibility

object hierarchy and to dynamically discover

what UI elements and objects are available.

- spatial navigation (up, down, left,

right)

- logical navigation (next, previous,

parent, first child, last child).

accNavigate ( long, VARIANT, VARIANT* );

get_accChild ( VARIANT, IDispatch** );

get_accChildCount ( long* );

get_accParent (IDispatch** );

Descriptive Properties and Methods

accDoDefaultAction ( VARIANT );

get_accDefaultAction ( VARIANT, BSTR* );

get_accDescription ( VARIANT, BSTR* );

get_accHelp ( VARIANT, BSTR* );

get_accHelpTopic ( BSTR*, VARIANT,

long*);

get_accKeyboardShortcut ( VARIANT,

BSTR*);

get_accName ( VARIANT, BSTR* );

get_accRole ( VARIANT, VARIANT* );

get_accState ( VARIANT, VARIANT* );

get_accValue ( VARIANT, BSTR* );

Selection and Focus

accSelect ( long, VARIANT );

get_accFocus ( VARIANT* );

get_accSelection ( VARIANT * );

Spatial Mapping

accLocation ( long*, long*, long* long*,

VARIANT);

accHitTest ( long, long, VARIANT* );

40

methods and properties to obtain information about an application's user interface and data. The

IAccessible interface methods and properties are organized into the following groups:

VARIANT Structure

Most of the Active Accessibility functions and the IAccessible properties and methods take a

VARIANT structure as a parameter. Essentially, the VARIANT structure is a container for a large

union that carries many types of data. The value in the first member of the structure, vt, describes

which of the union members is valid. Although the VARIANT structure supports many different

data types, Active Accessibility uses only the following types:

vt Value Corresponding value member name

VT_I4 lVal

VT_DISPATCH pdispVal

VT_BSTR bstrVal

VT_EMPTY none

In a VARIANT structure, the vt member defines which member contains valid data. The structure

has to be initialized, by calling the VariantInit COM function, before using, and the memory has to

freed by calling VariantClear.

IDispatch Interface

The IDispatch interface was initially designed to support Automation. It provides a late-binding

mechanism to access and retrieve information about an object's methods and properties. Previously,

server developers had to implement both the IDispatch and IAccessible interfaces for their

accessible objects; that is, they had to provide a dual interface. With Active Accessibility 2.0,

servers can return E_NOTIMPL from IDispatch methods and Active Accessibility will implement

the IAccessible interface for them. GetTypeInfoCount returns the number of type descriptions for

the object. For objects that support IDispatch, the type information count is always one.

GetTypeInfo retrieves a description of the object's programmable interface. GetIDsOfNames maps

the name of a method or property to a DISPID, which is later used to invoke the method or

property. Invoke calls one of the object's methods, or gets or sets one of its properties.

41

IUnknown

The IUnknown interface lets applications get pointers to other interfaces on a given object through

the QueryInterface method, and manage the existence of the object through the IUnknown::AddRef

and IUnknown::Release methods. All other COM interfaces are inherited, directly or indirectly,

from IUnknown. Therefore, the three methods in IUnknown are the first entries in the VTable for

every interface. IUnknown methods switch between interfaces on an object, add references, and

release objects:

IUnknown Methods Description

QueryInterface Returns pointers to supported interfaces

AddRef Increments reference count

Release Decrements reference count

Implementing Active Accessibility

To collect specific information about a UI element, clients must first retrieve an IAccessible for the

element. To retrieve an element's IAccessible object, clients can use one of the

AccessibleObjectFromX APIs:

• AccessibleObjectFromPoint

• AccessibleObjectFromWindow

• AccessibleObjectFromEvent

The functions AccessibleObjectFromX return a pointer to an IAccessible interface representing an

object.

AccessibleObjectFromWindow returns an interface representing a window, a client area, or some

non-client area control, depending on the second parameter, which is the object ID. The object ID

value in this parameter can be one of the standard object identifier constants or a custom object ID.

The child ID to be used with the pointer to IAccessible that is returned by

AccessibleObjectFromWindow is always equal to CHILDID_SELF:

if(S_OK == (hr = AccessibleObjectFromWindow(hWndMainWindow,

OBJID_WINDOW,IID_IAccessible,(void**)&paccMainWindow)))

{

•••

}

42

To obtain information about a UI element, you need the IAccessible interface/child ID pair that

corresponds to the UI element:

hr = pacc->get_accRole(*pvarChild, &varRetVal);

if (hr == S_OK && varRetVal.vt == VT_I4)

{

GetRoleText(varRetVal.lVal, lpszRole, cchRole);

}

else

lstrcpyn(lpszRole, "unknown role", cchRole);

The first step in enumerating children is to verify whether the accessible UI element supports the

IEnumVARIANT interface:

hr = paccParent -> QueryInterface

(IID_IEnumVARIANT, (PVOID*) & pEnum);

if(hr == S_OK && pEnum) pEnum -> Reset();

The next step is to obtain the number of accessible children owned by the parent by calling the

get_accChildCount function. Then, for each child, find the child ID and verify whether this child

supports the IAccessible interface by getting a pointer to the IDispatch interface and querying it for

an IAccessible interface:

if(paccChild)

{

VariantInit(&varChild);

varChild.vt = VT_I4;

varChild.lVal = CHILDID_SELF;

}

else

{

paccChild = paccParent;

paccChild->AddRef();

}

Then we can call other methods to retrieve object’s properties:

pacc->get_accRole(*pvarChild, &varRetVal);

pacc->get_accState(*pvarChild,&varRetVal);

pacc->get_accName(*pvarChild,&bsName);

43


Increasingly, screen readers are being bundled with operating system distributions. Recent versions

of Microsoft Windows come with the rather basic Narrator, while Apple Mac OS X includes

Voiceover, a more feature-rich screen reader. The console-based Oral Linux distribution ships with

three screen-reading environments: Misspeak, Years and Speak up. The open source GNOME

desktop environment long included Gnopernicus and now includes Orca. There are also open

source screen readers, such as the Linux Screen Reader for GNOME and NonVisual Desktop

Access for Windows. The most widely used screen readers are separate commercial products:

JAWS from Freedom Scientific, Window-Eyes from GW Micro, and Hal from Dolphin Computer

Access.

44

PART II

45

Integrating eye-gaze tracking and voice recognition

In this thesis, we describe multimodal interactions between an eye-gaze tracking system and a vocal

platform, with speech recognizer and synthesizer.

Until now, we examined the background that leads to the projection and implementation of the

thesis and we have restricted it to a generic description, on the theoretical level, of all the aspects

that influenced design decisions.

The next step is the practical implementation of the whole architecture.

The goal of this thesis is to demonstrate how systems described in the former chapters (eye-gaze

tracker and vocal platform) and the concept of multimodality can interact mutually, in an

architecture that overworks their benefits, suggesting an economic and efficient solution.

At the same time, this thesis intends to create a prototype that allows going beyond lacks of both

systems, improving them in meanings of precision and performances.

Eye-gaze tracker precision is limited to a specific observation area; this is a reason why we can say

that eye-gaze tracking systems do not offer the opportunity to exactly refer to specific object on the

screen. The voice has that needed precision but it does not allow selecting, in a certain instant, a

collection of objects. In this multimodal application, we want to combine together facilities of the

vocal platform and the eye-gaze tracking to define the query area, obtaining a prototype which

precisely defines a portion of the screen, with high accuracy level and which allows a high

reliability in meaning of search and recognition.

Nowadays many are the situations in which people need to interact with a Computer without having

the possibility to use hands and so refer to the traditional pointing devices, both because of

disabilities or both because people need to make several tasks at the same time.

We considered, as study case, the simplest situation a user can be involved in. We suppose that the

user is in front of the monitor and wants to select, between various objects, a specific one, and we

suppose that he wants to make an action on it. We also suppose that the user wants to interact with

the machine through a microphone, an eye-pointing system and a vocal platform, as shown in the

figure below:

46

Figure 11: scenario

- the webcam of the eye-gaze tracking system is situated under the monitor, in front of the user

- the microphone allows the user to establish a vocal connection with the speech recognizer

- A local network allows to connect the vocal platform, our prototype and the eye tracker.

The eye-gaze tracker defines a section on the screen called gazing block, built from a fixation point.

Through the eye-gaze tracking system the user has the opportunity to select several objects on the

screen and through the vocal system he can define a command. The prototype reacts executing

object’s default action and so on, producing a sequence of events and actions.

Requirement analysis

Now we will analyze the characteristics that an application must have to fulfill usability and quality

requirements as follows:

1. Usability requirements

They represent what the user needs to interact with the system using visual and voice

communication tools:

1.1. Select available objects on the fixation area

Eye pointing system determines the gazing block, an area that contains several objects. How we

said, this system cannot supply a high precision at object level, but it provides accuracy defining the

screen area which provide the basis for the vocal platform.

1.2. Accuracy of the correspondence objects/commands

47

The user needs to have an error-free application. Since we considered that the result provided by the

vocal platform is perfect, in testing we analyzed the errors due to the eye-gaze tracking system.

1.3. Distinction of ambiguous cases.

In many situations, users require to select objects on the screen having the same name and

belonging to different context. Users need the chance to refer clearly to a specific object and the

application has to deal correctly ambiguous cases. In other terms, the prototype has to perform an

additional manipulation on those repetitive objects that create confusion for the communication

between user, vocal platform and eye tracker.

2. Quality requirements

Users need to have an application that is:

2.1. Portable , to be integrated with other operating systems

2.2. To be integrated with any vocal and eye-gaze tracking system

2.3. Easy in its communication interface

2.4. Efficient in means of response time

2.5. Open to next modifications and improvements

Proposed Solution

The application developed in this thesis allows to:

1. Select available objects on the screen in the area defined by eye-gaze tracking system.

2. Define for every object its properties, such as name, role, position (on respect to fixation point),

state, default action.

3. Keep in memory all the selected objects through an appropriate data structure.

4. Allow distinctions between objects with the same name that belong to different windows.

5. Retrieve what the user said.

6. Translate the pronounced word in terms of name, role, state or action, i.e. to produce a

command.

7. Search for the command on the structure and find a matching between word and object name.

8. Execute the decided command.

The project includes several modules, so that we can separately analyze every element.

This allowed to make the prototype robust and open to further modifications and improvements.

48

Figure 12: System Overview (1)

Therefore, we can define and comment on the next paragraphs the following project units:

A. Eye-gaze tracker simulator

B. Screen reader

C. Vocal Platform

D. Connection voice-eye tracking

E. Execution of the desired command

Figure 13: System Overview (2)

49

A. Eye tracker simulator

Figure 14: eye tracker simulator

We want to simulate a real eye-gaze tracker working on a limited area, having a range defined by

the operator in the configuration step. Simulation is achieved through mouse and it is obtained from

a fixation point, i.e. the point in which the mouse is located at a given moment. This project unit

calculate and analyses mouse position and events, providing in output a set of points which is called

gazing block. So we can define mouse events and distinguish between fixations and movements.

Events are determined by the following conditions:

- After having fixed the point on the screen in which the cursor is set, if this position does not

change during a given time, then fixation is achieved.

- In case of fixation, the simulator defines the gazing block and calls screen reader unit.

- at every mouse moving this unit computes the new position and repeats the algorithm.

B. Screen Reader

50

Figure 15: screen reader module

From the eye tracker simulator, it receives in input the gazing block, containing a set of points,

which defines the working area for this unit project. Therefore, it supplies a set of objects to the

vocal module.

This unit has a double role:

- Traditional screen reader

- object localizer

Operating as a traditional screen reader, this unit has to enumerate objects and define for each of

them name, role, state, default action, position on the screen and location related to mouse position

and other objects. Elements obtained in this way are inserted in a data structure that permits

appropriately objects insertion and search.

Following objects localizer definition, this unit has to use the input provided by the eye tracker

module and has to permit to discard objects that are not situated on the tracking area or objects that

are nameless or are “invisible”. The set of the object represents the voiceXML grammar for the

vocal unit.

C. Vocal Platform

Figure 16: vocal platform

It receives in input, from the Screen Reader, a collection of objects, i.e. the grammar, and supplies

in output the word pronounced by the user. We decided to install the vocal platform on another

51

computer, create a local Web Server and a LAN (Local Area Network) between the computer and

our system. The platform interacts on one side, through the LAN, with the user and on the other

side, via HTTP, with a web application in the Web Server. The web application generates

VoiceXML code instead of HTML code, and instead of having a HTML Interpreter (a browser like

Internet Explorer) to interpret the code, in order to present appropriately the content to the user,

there is a VoiceXML Interpreter (that is, a VoiceXML browser) to process the VoiceXML code.

The VoiceXML application interprets and processes spoken words on the basis of the grammar, it

allows distinction between ambiguous cases and it provides the recognition result to the next unit of

the prototype.

D. Link eye tracking - voice

This is the main problem of the whole project. The algorithm needs an appropriate “communication

channel” between all the units of the system.

On one hand, the vocal unit receives on the LAN user spoken words, on the other hand it provides

HTTP messages, containing the speech recognition result, to the prototype. So we have to correctly

interpret the message and then send it to the prototype main unit.

This module was divided into two parts, the first one is the HTTP Message Interpreter and the

second one is the Receiver. We referred to a Socket structure in which the HTTP Message

Interpreter holds the role of the Client and the Receiver holds the role of the Server. The client

interprets the HTTP method, sends the result through the socket, the server receives it and performs

object search and command execution.

Figure 17: Socket Structure

52

E. Command Execution

It receives in input the result of the recognition process through the Socket described above, and

provides as output the desired command.

In this unit, a search is performed through the data-structure, by matching the word received from

the previous step with every element of the structure.

As result of the matching, the default action of the chosen object is performed.

We can imagine a further step by ordering another command to the system and repeating the entire

process.

53

Unit Development Every module, except those concerning the vocal platform, was implemented using the C++

programming language.

The main reason is due to the Microsoft Application Programming Interface (API) involved in the

Screen Reader unit.

The library used by Accessibility’s APIs is oleacc.lib, included in the Microsoft SDK version 2.0.

The library used by the Socket Module is ws2_32.lib.

We developed the software using the framework Visual Studio 2003 .NET version 1.1.4322.

Loquendo provided VoxNauta Platform 6.0, with ASR and TTS for Italian language.

VXML files were developed under the standard VXML 2.0.

This application is open source and open to future modifications.

The software was divided in correlated modules in order to easily control complexity, and obtain a

more localized alteration, in event of changes.

Eye-Gaze Tracker Unit

This unit simulates an eye-gaze tracking system.

As we said before, the simulation can adapt to eye-gaze tracking systems that works by fixation

points or by area selection. The working area is called gazing block and it is defined through a

parameter given by the operator in the step of system’s configuration. In our simulation we

considered the gazing block range equal to 80 pixels. In the first part of this unit we define the

behavior of an eye tracker considering mouse events. The algorithm we used emulates eye tracker

through the mouse, considering fixation points. Using the words “fixation points” we mean the

point in which user’s gaze is fixed. In terms of mouse events, this concept can be translated

considering mouse movements in reference to an initial fixation point.

If the point changes we do not have fixation.

If the point does not change during a time interval, we had fixation.

Subsequently, this unit provides the eye tracker mask considering the fixation point.

The mask defines the limits of the eye tracker working area, and can be represented by a matrix of

pixels.

54

Figure 18: Gazing Block

Function ETPoints

RECT ETPoints (POINT pt, int epsilon)

{

ETRect.left=pt.x-epsilon;

ETRect.bottom= pt.y+epsilon;

ETRect.right= pt.x+epsilon;

ETRect.top= pt.y-epsilon;

return ETRect;

}

This function returns a RECT structure, ETRect, which represents eye tracker’s mask.

The central point of the matrix is the fixation point pt described previously and matrix limits are

defined by a parameter called “epsilon”.

This unit also deals with objects and windows location. The object or window area is defined

through a rectangle of pixels. If we have intersection through the gazing block of the eye tracker

and the object’s rectangle, then we can consider the object and insert it into the object tree.

Function CheckPointsW

bool CheckPointsW(RECT rc)

{

RECT rdest;

IntersectRect(&rdest,&ETRect,&rc);

if(!IsRectEmpty(&rdest))

{

return true;

}

55

Objects Position

Figure 19: Visual Attributes

This function checks for object position and defines its exact localization regarding mouse position.

It is necessary for ambiguity control. Objects Position is defined by assigning visual attributes to the

objects: “up”, “down”, “left”, “right”.

Function CheckPoints

bool CheckPoints(RECT rc,POINT pt,nodo_t *tvi)

{

RECT rdest;

IntersectRect(&rdest,&ETRect,&rc);

if(!IsRectEmpty(&rdest))

{

if(pt.x>rc.left && pt.x>rc.right)

strcpy(tvi->posizione,"UP");

else if(pt.x<rc.left && pt.x<rc.right)

strcpy(tvi->posizione,"DOWN");

if(pt.y>rc.top && pt.y>rc.bottom)

strcpy(tvi->posizione,"RIGHT");

else if(pt.y<rc.top && pt.y<rc.bottom)

strcpy(tvi->posizione,"LEFT");

return true;

}

else

return false;

}

Screen Reader Unit

This unit represents a Screen Reader and, since it receives the output provided by the Eye Tracker

Simulator Unit, holds the role of a traditional Screen Reader and an Object Locator.

56

As a traditional Screen Reader, it attempts to identify and interpret what is being displayed on the

screen. As an Object Locator it retrieves the objects in the gazing block located by the eye tracker.

We used a Binary Search Tree (BST) as structure to contain retrieved object because it proved to

provide an efficient solution for the algorithm we used.

A BST is a binary tree data structure, which has the following properties:

- Each node has a value.

- A total order is defined on these values.

- The left sub tree of a node contains only values less than the node's value.

- The right sub tree of a node contains only values greater than or equal to the node's value.

The major advantage of binary search trees is that the related sorting and search algorithms can be

very efficient, since these algorithms have a logarithmic cost.

Screen Reader Unit begins enumerating Desktop active windows, by calling the function:

EnumChildWindows (0, WinProcEnum, 0)

This function enumerates the child windows that belong to the specified parent window by passing

the handle to each child window, in turn, to WinProcEnum () that is an application-defined callback

function. We wanted to enumerate all Desktop windows and we passed a NULL handle to the

function. EnumChildWindows continues until the last child window is enumerated or the callback

function returns FALSE. WinProcEnum function inspects every window, checking if it belongs to

the gazing block (by calling the CheckPointsW () function) and using Active Accessibility APIs to

retrieve objects:

BOOL CALLBACK WinProcEnum (HWND hwnd, LPARAM lParam)

{

...

GetWindowRect (hwnd, &RC);

If (CheckPointsW (RC))

{

. . .

AccessibleObjectFromWindow (hwnd, OBJID_WINDOW, IID_IAccessible, (void**)

&pacc);

In order to use the Microsoft Active Accessibility technology (MSAA) to find information, the

application must support Active Accessibility and its elements must support the IAccessible

interface. The general algorithm is that by using the window handle of an application’s main

window, we get an IAccessible pointer. This IAccessible pointer is the root of a tree that contains

nodes for all of the IAccessible elements in this window.

57

The AccessibleObjectFromWindow () function gets an IAccessible pointer to the window.

Now we want to know how to traverse the tree of IAccessible objects and how to get information

about each object. We used a recursive function GetChilds () with parameters IAccessible and

VARIANT. These two data types identify the node we are working with.

node_t* GetChilds (IAccessible* pacc, VARIANT* pvarT, node_t* hParent)

This function recourse only if the object is a full object that is the fields of the VARIANT structure

are not empty:

if (pvarT->vt == VT_I4 && pvarT->lVal == CHILDID_SELF)

It means that we have a full object that may have children, so we call the method:

pacc->get_accChildCount(&cchildren);

to find how many children there are and we allocate memory to store information about each other:

rgvarChildren = (VARIANT*) malloc(sizeof(VARIANT)*cchildren);

To retrieve the child ID of each of the children we execute:

AccessibleChildren(pacc, 0L, cchildren, rgvarChildren, &cObtained);

The function

AccessibleChildren(IAccessible* paccContainer, LONG iChildStart, LONG cChildren,

VARIANT* rgvarChildren, LONG* pcObtained)

retrieves the child ID for each child in paccContainer object and fills rgvarChildren with it.

iChildStart is the zero-based start index, cChildren is the amount of children to retrieve, and

pcObtained is the actual number of children retrieved. We then check if the function succeeded and

if the amount of children we expected to retrieve actually were retrieved.

The next step is to loop through each of these children, attempt to get a full IAccessible object

representation of them and then process them:

for(ichild=1;ichild <= cchildren;ichild++)

{

VariantInit(p_varT);

p_varT = &rgvarChildren[ichild-1];

To attempt to get a full object, we need to check what the value of p_varT is: if (p_varT->vt == VT_I4)

pacc->get_accChild(*p_varT, &pdisp);

get_accChild method retrieves an IDispatch interface pointer for the specified child.

We get an IDispatch object representing our object and then convert the IDispatch object to an

IAccessible object such as:

58

pdisp->QueryInterface(IID_IAccessible, (void**)&paccChild);

The QueryInterface method returns a pointer to a specified interface on an object.

If we obtained an object, we recourse on its children:

if (paccChild)

{

VariantInit(p_varT);

p_varT->vt = VT_I4;

p_varT->lVal = 0;

hParent=GetChilds(paccChild, p_varT,hParent); paccChild->Release();

}

The next step is to extract information from the objects we retrieved.

GetChilds() function does it by calling the BuildTree() function.

node_t* BuildTree(IAccessible* pacc, VARIANT* pv, node_t* hParent)

The BuildTree() function retrieves information about an object by calling Active Accessibility

methods.

1. get_accName method retrieves the name of the specified object:

pacc->get_accName(*pvar, &bszT);

2. get_accState method retrieves the current state of the specified object:

pacc->get_accState(*pvar, &varT);

3. get_accRole method retrieves information that describes the role of the specified object:

pacc->get_accRole(*pvar, &varT);

4. accLocation method retrieves the specified object's current screen location:

pacc->accLocation(&left,&top,&width,&height,*pvar);

Where:

- left and top are the addresses of variables that receive the x and y coordinates of the upper-left

boundary of the object's location, in physical screen coordinates.

- width and height are the addresses of the variables that receive the object's width and height, in

pixels.

This part explains what we intended by using the terms “object locator”.

We decided to refer with these variables to a RECT structure and to verify if the object is contained

in the gazing block, we called the function CheckPoints ():

rc.left=(long)left;

rc.top=(long)top;

rc.right=(long)(left+width);

rc.bottom=(long)(top+height);

CheckPoints(rc);

59

If CheckPoints returns TRUE then we can insert the object into the BST structure:

hParent=insert_BST(hParent,t_obj);

The insert_BST () function inserts the retrieved object into the BST sorting them by name.

If the object name is present yet, this function compares window name:

if it is the same it means that the object was inserted before.

Otherwise, it means that there is another object belonging to a different window and inserts it in a

list. If the value of the object’s name is greater or lower than the actual processed value, than

insertion is performed respectively on the right and on the left branch of the BST:

if (root == NULL)

{

r->left = r->right = NULL;

return (r);

}

if (compar_name(r->name, root->name) == 0)

{

if (compar_name (root->window_name,r-> window_name)==0)

/* same name and same window: do not reply */

return root;

else

{

/* same name but different window: insertion in a list */

elem=insert_list(&root->head,r);

root->list=elem;

return (root);

}

}

else if (compar_name (root->name, r->name) > 0 )

{

/* insert on the right branch */

root->right = insert_BST (root->right, r);

}

else

{

/* insert on the left branch */

root->left = insert_BST (root->left, r);

}

The insertion on a list allows us to deal with objects with the same name but that belong to different

windows, this situation is called “ambiguity”. We showed in the “Eye tracker simulator Unit” how

we solved this problem.

Link eye tracking – voice

To correctly allow communications between the vocal platform and our application, we created a

Local Area Network (LAN).

60

The platform interacts on one side, through the LAN, with the user and on the other side both with

the voiceXML application and with our system.

The vocal platform provides HTTP messages; the HTTP Message Interpreter receives and interprets

them, then sends the result to the Receiver.

We used a Socket structure in which the HTTP Message Interpreter is the Socket client and the

Receiver is the Socket server.

HTTP Message Interpreter

It receives the HTTP message sent by the voiceXML application through POST method.

In our project the HTTP Message Interpreter holds the role of the Client in a Socket Structure.

For first, it creates a socket to connect to the Server:

Connect Socket = socket(ptr->ai_family, ptr->ai_socktype,ptr->ai_protocol);

Then it connects to the Server by calling:

connect( ConnectSocket, ptr->ai_addr, (int)ptr->ai_addrlen);

then it decodes the HTTP Message by calling the Decode() function, then sends the codification

result to the Server:

send( ConnectSocket, data, (int)strlen(data), 0 );

in addition, closes the connection:

closesocket(ConnectSocket);

Receiver

It receives the result of the HTTP message through the Socket. It holds the role of the Server in the

structure we used.

It creates its own socket:

ListenSocket=socket(result->ai_family, result->ai_socktype,result->ai_protocol);

In addition, setups the socket:

bind( ListenSocket, result->ai_addr, (int)result->ai_addrlen);

Then this unit listens on the socket to accept incoming connections:

listen(ListenSocket, SOMAXCONN);

It accepts the connection by calling:

ClientSocket = accept(ListenSocket, NULL, NULL);

Hence the Receiver obtains the message on the socket:

61

recv(ClientSocket, recvbuf, recvbuflen, 0);

Where the recvbuf variable contains the vocal recognition result.

Once the Receiver obtained the recognition result, it calls the Execution Unit to perform the action.

VoiceXML Unit

It is composed by the Vocal Platform and the VoiceXML application. In order to correctly interpret

uttered words, the vxml application needs an appropriate grammar, which is the set of admitable

words.

We used a dynamic grammar, on the local webserver :

<?xml version="1.0" encoding="UTF-8" ?>

<grammar version="1.0" xml:lang="it-it"

xmlns="http://www.w3.org/2001/06/grammar"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.w3.org/TR/speech-grammar/grammar.xsd"

root="main">

<rule id="main">

<one-of>

<item>Nuovo Documento di testo.txt - Blocco note alto</item>

<item>Nuovo Documento di testo (2).txt - Blocco note basso</item>

<item>Modifica basso</item>

<item>Modifica alto</item>

<item>LAST VERSION basso</item>

<item>Formato basso</item>

<item>Formato alto</item>

<item>Desktop alto</item>

</one-of>

</rule>

</grammar>

Every <item></item> tag contains the objects retrieved by the Screen Reader Unit; they appear

with Name and visual attributes.

The voicexml application and grammar

<?xml version="1.0"?>

<vxml version="2.0">

<var name="chosenObject" expr="'void'"/>

<form id="form_object">

<field name="object" >

<grammar src="http://127.0.0.1/Tesi/gram.grxml" type="application/srgs+xml"/>

<filled>

Hai detto: <value expr="object$.utterance"/>

<assign name="chosenObject" expr="object$.utterance"/>

<goto next="#form_submit"/>

</filled>

</field>

62

<nomatch>

<prompt>non ho capito

</prompt>

</nomatch>

</form>

<form id="form_submit">

<block>

<submit next="http://127.0.0.1/cgi-bin/client.exe" method="post"

namelist="chosenObject"/>

</block>

</form>

</vxml>

The first two lines show vxml version.

The <field> tag refers to the vxml dialog that deals with the grammar:

if the user speaks, and uttered words match the grammar, then the application provides a vocal

output prompting what user said. If there was no match, the user is asked to repeat the word.

The recognition result is sent through HTTP Post method to the HTTPMsgInterpreter Unit.

63

Testing

64

Test Description The scenario considered in our tests is shown in the diagram below:

Figure 20 : the test

From left to right we can notice the following units:

a real eye-gaze system, the application considered in this thesis and a speech recognition system.

The signal exiting from the ideal eye tracking system is “dirtied” by a Gaussian noise with null

mean and standard deviation σ. This allows us to represent possible errors of the eye tracking

system in its working area. Errors are due in part to the calibration of the whole system and in part

to imperfections of user’s inputs, concerning a not perfect or constant fixation in the desired point.

The prototype proposed in this thesis receives signals from both real eye tracking and vocal system.

The application has also a configuration parameter used to define gazing blocks and “alexpirex

system” offers the opportunity to distinguish and solve ambiguous cases in object recognition, as

we explained in the former chapter. The vocal platform receives and processes words pronounced

by users, based on the grammar supplied by the prototype. It provides in output the result of the

speech synthesis and recognition, in form of a vocal signal. In our tests, we supposed excellent

speech recognition.

65

Test Goals

Tests aim to evaluate the performances of the algorithm we used.

That is, on the one hand we have to evaluate the recognition of the input offered by the vocal

platform, combining together criterias for query and location of the objects suggested by our

algorithm. On the other hand we wanted to give an estimation in terms of time processing of the

prototype both in the step of building the structure that keeps in memory objects detected on the

screen through the eye pointing system, both in the step of search and matching of the input coming

from the vocal platform. By having different measurements of the uncertainty due to the visual

system and the extent of gazing blocks, we planned the possibility to obtain a high recognition

percentage for little values of uncertainty and a wide recognition area.

Increasing errors introduced by the pointing system, instead, we planned a diminution of the

recognition percentage.

66

Test 1: Context

In the first paragraph, the first group of tests will be presented and analyzed.

The goal is to determine objects recognition percentage in the fixation area, specified by the eye

tracking system by varying some parameters and using Gaussian noise to perturb input signal.

1. Gaussian Noise

A Gaussian distribution was used to model the real eye-tracker errors.

A Gaussian distribution is fully specified by mean µ and covariance σ.

In terms of eye tracking, the mean value of the interpolated point corresponds to the estimated

position, and the confidence in the estimate is expressed by the covariance σ. Hence, it is possible

to evaluate the actual performance with the estimated confidence for qualitative evaluation of the

eye tracker.

2. Varied Values

The parameters we considered in this first phase of testing are relative to the uncertainty given by

the visual tracking system (we called it σ referring to the Gaussian distribution) , and they are

relative to gazing blocks width, that is the working area of the “alexpirex system”.

We want to define how, varying eye tracker uncertainty, we can find an optimal working area

represented by gazing blocks width and the fixation point.

This optimal value permits to minimize approximation introduced by the Gaussian noise.

3. Estimated position and Gaussian noise

We selected ten time instants corresponding to ten different fixation points that we call P_T (i).

The set of points P_T (i) represents in screen coordinates ten different objects that user wants to

select in a given moment i.

Every point of this set reproduces the centre of the ideal eye tracker’s gazing block

In the test, we considered also a set of words, Word (i), in which every element represents an object

indicated by the user, through voice.

Since we refer, instead, to a real pointing system, subjected as we said before to errors due to a

Gaussian noise, every point P_T(i) will be modified in a set of points P_SIGMA_T(i), given by

67

adding the fixation point with the elements of a random Gaussian distribution with null mean and

deviation σ :

P_SIGMA_T (i) = P_T (i) + N (0, σ)

In fact, P_T (i) represents the mean of the Gaussian distribution, i.e. the point corresponding to the

estimated position.

The scenario can be briefly described as follow:

Figure 21

4. Test values

In our tests, we considered five different values of uncertainty: 0, 8, 16, 32, and 64.

This choice is correlated to the screen unit measurement , i.e. the pixel, and has been varied from

zero (absence of uncertainty) until 32 (icon’s dimension), adding intermediate values (such as 8

and16) and a high value, 64, representing a higher un-correlation factor between estimated point

P_T(i) and error P_SIGMA_T (i) .

We chose gazing blocks having dimensions: 1, 10, 20, 30, 50, and 80.

As we referred to the screen unit measurement, these values represent the variations of the

localization area, from one pixel (reduced area) until 80 pixels (large area), passing through

intermediate values.

Uncertainty 0 8 16 32 64

Gazing block width 1 10 20 30 50 80

68

5. “ambiguous” and “safe” recognition

Objects considered in testing have been chose among those of the Computer’s Desktop, so that our

algorithm could manage recognition and distinction between objects with same name but belonging

to different windows.

During this dissertation, we will use the terms “ambiguous recognition” or “ambiguity” if objects

occurs in more than a window. Otherwise, we will refer to “safe recognition” in case that only one

object exactly corresponds to matching between spoken word and gazing block content.

We chose a set of ten elements with five objects with “safe” recognition and five objects considered

“ambiguous”.

Table 1: ambiguous and safe objects in our tests

Ambiguous

Recognition

Safe

Recognition

File Test1.txt

Tesi Test2.txt

MultimodalSystems Test3.txt

TcpView Test4.txt

SysteracXPTools Test5.txt

69

Test 1: Result

We go now describing and commenting tests results.

A. Average trend recognition varying gazing block area and eye-tracker uncertainty

Graphs in figures 22 and 23 describe the average trend of the recognition percentage for ten

different fixation points in five different tests, by varying pointing system’s uncertainty and varying

gazing block area.

Figure 22 shows the exact recognition percentage, without considering ambiguous objects

recognition. As we see, for low values of uncertainty and a wide gazing block area we have a high

recognition percentage. When the error introduced by the Gaussian distribution raises, even if we

have a limited gazing block area, the recognition decreases: this is the critical situation in the point

(s=64, e=10).

However, the best situation is at the point (s=16, e=80). It means that our algorithm works well with

an eye tracker that offers low error percentage and produces wide gazing blocks.

Figure 22: Safe Recognition Average trend

In figure 23 average trend recognition is shown including ambiguous values too.

That is we summed to “safe recognition” values, ambiguous cases too, because ambiguity does not

force discarding objects from selection but implies an additional processing on them to be correctly

indicated by the user.

70

Referring to the precedent values, we notice that with low uncertainty values un-recognition cases

were due to object ambiguity. Instead, the critical situation is with high uncertainty and limited

gazing block area. It means that gazing block width was not capable to compensate the effects of

noise because the elements of the casual Gaussian distribution were not included in the

neighborhood of the considered object.

From these two figures, we can deduct that the gazing block area obtained grows but is limited

because of ambiguity.

Figure 23

Table 2

RECOGNITION BEST SITUATION RECOGNITION CRITICAL

SITUATION

Uncertainty Gazing block dimension Uncertainty Gazing block dimension

16 80 64 10

71

Table 3 shows tests results:

Uncertainity (s), Gazing

Block Width (e)

Recognized and ambiguos Average

Recognized only Average

s=0,e=1 10 10 s=0,e=10 10 10 s=0,e=20 10 9 s=0,e=30 10 8 s=0,e=50 10 5 s=0,e=80 10 5 s=8,e=1 9,8 9,8 s=8,e=10 10 9,4 s=8,e=20 10 8,8 s=8,e=30 10 8 s=8,e=50 10 5,4 s=8,e=80 10 5 s=16,e=1 9 9

s=16,e=10 10 10 s=16,e=20 10 9,2 s=16,e=30 10 8 s=16,e=50 10 5,6 s=16,e=80 10 5 s=32,e=1 7,2 7,2

s=32,e=10 8,6 8,2 s=32,e=20 9,6 8,8 s=32,e=30 9,6 8,4 s=32,e=50 10 6,2 s=32,e=80 10 5,2 s=64,e=1 3,2 3,2

s=64,e=10 4,8 4,6 s=64,e=20 6,6 6 s=64,e=30 7,4 6,2 s=64,e=50 8,2 5,6 s=64,e=80 9,4 5,6

Table 3:Tests results

B. Average trend recognition varying gazing block area

For a given value of induced noise, we want to see how recognition changes, referring to gazing

block area variations. In figure 24 we considered an ideal eye tracker, without noise, i.e. Gaussian

noise has covariance equal to zero. Here too we drown curves for “safe” recognized objects only

72

and for the sum of “safe” and “ambiguous” recognition values. The curve is limited between two

asymptotes which represents respectively the maximum and the minimum recognition value.

We can see how, since we supposed an ideal eye tracker, the maximum recognition value touches

the upper horizontal asymptote. The curve of the sum of “safe” and “ambiguous” values coincides

with the upper asymptote, because Gaussian noise did not influenced recognition, and we obtained

the maximum admitable value. Instead, the “safe” recognized curve tends to the lower asymptote

for the effects of ambiguity. We will never have a value inferior to the lower asymptote and it

demonstrates the goodness of our algorithm.

Figure 24: average recognition trend varying gazing block area

In the second graph, instead, we introduced a high value of uncertainty.

In addition, we notice how the recognition percentage highly decreases with wide gazing block

areas, because of ambiguity. The optimal gazing block area is expressed by the maximum of the

curve included in the two asymptotes

73

Figure 25

Table 4 summarizes graph values:

uncertainty Recognized max

value

Recognized min

value

Ambiguous and

recognized max

value

Ambiguous and

recognized min

value

No uncertainty 10 5 10 10

High uncertainty 9 5.4 10 7.2

Table 4

Table 5 shows obtained values

No uncertainty High uncertainty

Gazing block width

Recognized and

ambiguous

Recognized

0 10 10 1 10 10 10 10 10 20 10 9 30 10 8 50 10 5 80 10 5

Table 5

Gazing block width

Recognized and

ambiguous

Recognized

0 7,2 6,8 1 7,2 6,8 10 8,6 7,8 20 9,6 9 30 9,6 8,6 50 10 6,6 80 10 5,4

74

C. Average recognition trend varying eye-tracker uncertainty

Graph shown in figure 26 was built fixing gazing block width at the minimum value.

Here we considered a high-precision eye tracker and we increased the error introduced by the

Gaussian noise.

We can see how, even if the gazing block is reduced too much, “safe” recognition percentage

strongly decreases with high values of Gaussian noise. This is because the interference introduced

by the Gaussian noise is considerable.

Figure 26

Figure 27 shows how wide gazing blocks were able to compensate noise effects

75

Figure 27

Table 6 summarizes the aspects we considered

Gazing Block Width = 0 Gazing Block Width = 80

Uncertainty

Recognized and ambiguous Average


0 10 10 8 10 9,8

16 8,6 9 32 6,8 7,2 64 3,6 3,2

Table 6

Uncertainty

Recognized and ambiguous Average


0 5 10 8 5 10

16 5 10 32 5,4 10 64 5,2 9,4

76

Test 2 In the second test, we examined and estimated the processing time of the algorithm.

Since the entire process depends on user’s vocal capabilities, we estimated insertion and search time

only. As said before, we used a BST data structure to contain retrieved object because it provides a

logarithmic cost for the sorting and search algorithms. The insertion on a BST is performed by

considering the sorting on structure’s branches, and it depends on how many objects the user wants

to select with the eye tracker. As we considered an environment in which objects number does not

increases strongly, the use of the BST structure allowed limiting processing time. First, we

considered processing time referring to the object structure building.

Figure 28 refers to insertion processing time (in seconds) retrieved in a single test with ten different

points. Insertion time values vary from a minimum of 0.015 seconds until a maximum of 0.406

seconds, which is an excellent result.

Figure 28:Insertion Time

77

Gazing Block Width

1 10 20 30 50 80 0,062 0,031 0,015 0,031 0,031 0,031 0,406 0,203 0,187 0,187 0,25 0,188 0,188 0,188001 0,172 0,203 0,202999 0,187 0,187 0,187 0,187 0,171 0,187 0,187

0,202999 0,187 0,188 0,204 0,187 0,171999 0,188 0,187 0,188 0,188 0,172001 0,188 0,187 0,187 0,188 0,188 0,188 0,172001

0,171001 0,187 0,188 0,171997 0,171997 0,218998 0,172001 0,188 0,187 0,188 0,188 0,203003

0,187 0,202999 0,187 0,187 0,187 0,171997

Table 7: Insertion Time

As regards search time, results show that is always zero, confirming the hypothesis of a logarithmic

cost.

78

Test Results

1. Quality and usability requirements fulfillment

1. The application is integrable with other operating systems. Though we developed this

prototype using Microsoft Windows XP, it can be easily integrated on other Operating

Systems, like Linux.

2. The application is portable with any vocal and eye-gaze tracking system. Any type of

eye-gaze trackers can make use of the benefits offered by this application, since the eye

tracker simulator can be adapted to the one the operator is intended to use.

3. Reliability. The functionalities offered by this application correspond to the above-

mentioned requirements. Reliability is influenced by eye-gaze tracker because it has to

be robust reliable; but above all it is influenced by the high number of objects the user

wants to select, in general terms, by the number of objects contained in the gazing block.

Test results assert that we can obtain better performances with robust eye tracker

working on limited gazing block.

4. The application is easy in its communication interface. The interface this application

uses is the Computer monitor, easy to use and to understand.

5. The application is open to next modifications and improvements. The prototype we

developed is intelligible to guarantee exactness and re-usability, to be easily modified in

the future. Making the software comprehensible we divided it in units easy to modify

and improve.

2. Performances requirements fulfillment

2. The application is efficient in means of response time.As tests explained, response

time depends on how many objects the user wants to select using the eye tracker. By

using an efficient structure, insertion time is very low and search time is zero.

3. Selection of available objects on the fixation area. Since it is achieved by using eye-

tracking systems, it depends on tracking system’s accuracy.

4. Distinction of ambiguous cases. Ambiguity has been solved correctly by using visual

attributes, like “up”, “down”, “left”, “right”.

79

Conclusions

In this study, a system capable of speech recognition and object selection, integrated with an eye

tracker simulator and a vocal platform has been developed.

The prototype was developed in C++ and in VoiceXML.

The gazing block extracted from the tracking simulation carries out a set of objects containing the

object that a user wants to select.The boundary of the gazing block is determined through a

configuration parameter, which has been modified in testing.

The objects were set in an appropriate structure that allows quick insertion and search and after the

set of objects was located, it is passed to the vocal platform as a voiceXML grammar. Word

detection and recognition is demanded to the vocal platform, which interprets and processes spoken

words; this is performed using the vxml interpreter and a vxml application that indicates how the

words must be interpreted. Matching between objects and spoken words is made by a query in the

object structure. The application allows repeating the entire process.

In order to detect the presence of the object in the tree, a condition must be fulfilled: the presence of

the object in the gazing block.

Test results demonstrated that this systems work well with robust eye trackers working in a limited

area. This issue is not the result of the eye tracker itself, but to cases of ambiguity, although this

condition is not so ordinary on a screen desktop.

The system we have developed surely needs improvements and modifications, because it is fully

automatic but largely dependent on users.

There are improvements that we would like to point out.

1. Eye tracker simulator: we would like to improve that unit and make it integrable with other eye

tracking systems

2. Structure used to keep objects: since the use of BST implies a logarithmic query time, this aspect

can be improved perhaps using different structures.

3. The prototype here presented can be helpful to people with disabilities. In this sense it depends on

vocal capabilities of the user, that is on how much damage of the vocal chords or muscles the

patient is suffering from. By now the application was tested on normal-able people.

4. The application can be integrated to other devices, but this was not possible to achieve.

Our efforts are directed above all to Virtual Environment and Virtual Reality field.

81

References

1. Norris, S. 2004. "Analyzing Multimodal Interaction: A Methodological Framework", Routledge

2. Audio-Visual and Multimodal Speech Systems, Benoit, Martin, Pelachaud, Schomaker, Suhm

3. Eye tracking in human-computer interaction and usability research: ready to deliver the

promises, Robert J.K. Jacob, Keith S. Karn,

4. Remote Eye Tracking: State of the Art and Directions for Future Development, Martin Böhme,

André Meyer, Thomas Martinetz, and Erhardt Barth, Proceeding of COGAIN 2006

5. State of Art of Eye Tracking, Chabane Djeraba, Laboratoire d’Informatique Fondamentale de

Lille, 2005

6. http://cslu.cse.ogi.edu/asr/

7. “Speech and Language Processing: An introduction to natural language processing,

computational linguistics, and speech recognition”, Daniel Jurafsky & James H. Martin).

8. Speech Recognition HOWTO- Stephen Cook:

http://www.ibiblio.org/pub/Linux/docs/HOWTO/translations/italian/html_single/Speech-

Recognition-HOWTO.html.gz

9. The role of voice input for human machine communication, Cohen and Oviatt

10. Hidden Model Sequence Models for Automatic Speech Recognition, Thomas Hain, 2001

11. Loquendo VoxNauta Manuals

12. Theofanos, Mary Frances, and Redish, Janice (Ginny) (November-December 2003).

"Guidelines for Accessible and Usable Web Sites: Observing Users Who Work With Screen

Readers" (HTML). Self-published version, Redish & Associates. Retrieved on 2007-26-01.

13. Talking Terminals, BYTE, September 1982. Retrieved on September 7, 2006.

14. Paul Blenkhorn, "The RCEVH project on micro-computer systems and computer assisted

learning", British Journal of Visual Impairment, 4/3, 101-103 (1986). Free HTML version at

Visugate. See also Access to personal computers using speech synthesis. RNIB New Beacon

No.76, May 1992. Retrieved on August 17, 2005.

15. Speech Input in Multimodal Environments: A Proposal to Study the Effects of Reference

Visibility, Reference Number, and Task Integration. Michael A. Grasso

16. Multimodal interaction modeling, Viorel Popescu, Grigore Burdea and Helmuth Trefftz

17. Speech-augmented eye gaze interaction with small closely spaced targets, Proceedings of the

2006 symposium on Eye tracking research & applications, Miniotas,Špakov, Tugoy, MacKenzie

18. VoiceXML 2.0 W3C Recommendations : http://www.w3.org

82

19. Microsoft Active Accessibility: http://www.microsoft.com/enable/

20. Automated Reverse Engineering of Graphical User Interfaces, Gilad Suberri

21. Software Driving Software: Active Accessibility-Compliant Apps Give Programmers New

Tools to Manipulate Software, Dmitri Klementiev

Multimodal Interaction: an integrated speech and gaze approach

Documents

Transcript of Multimodal Interaction: an integrated speech and gaze approach