SOFTWARE FRAMEWORK FOR PARSING AND INTERPRETING
GESTURES IN A MULTIMODAL VIRTUAL ENVIRONMENT
CONTEXT
François Rioux
Department of Electrical and Computer Engineering
McGill University, Montréal
June 2005
A thesis submitted to the Faculty of Graduate Studies and Research
in partial fulfilment of the requirements of the degree of
Master of Engineering
© FRANÇOIS RIOUX, 2005
1+1 Library and Archives Canada
Bibliothèque et Archives Canada
Published Heritage Branch
Direction du Patrimoine de l'édition
395 Wellington Street Ottawa ON K1A ON4 Canada
395, rue Wellington Ottawa ON K1A ON4 Canada
NOTICE: The author has granted a nonexclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or noncommercial purposes, in microform, paper, electronic and/or any other formats.
The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
ln compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis.
While these forms may be included in the document page cou nt, their removal does not represent any loss of content from the thesis.
• •• Canada
AVIS:
Your file Votre référence ISBN: 978-0-494-22666-7 Our file Notre référence ISBN: 978-0-494-22666-7
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats.
L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse.
Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant.
Abstract
Ruman-computer interaction (Rel) is a research topie whose eventual outcome will allow users of
computer systems for more natural interfaces than the traditional keyboard and mouse. Ideally,
those interfaces would exploit the same communication channels used in everyday life, which are
speech, gestures or any other expressive feature of the human body. In this thesis, a continuous
dynamie gesture recognition system is presented. Positioning devices such as a mouse, a data glove
or a video camera are used as input streams to the recognition module, from which it extracts
the most interesting features. Gestures are recognized continuously, which means that no prior
temporal segmentation is necessary. Training facilities are also available in order to build gesture
models from known segmented gesture sequences. In order for users to effectively reuse code and
build new modules that share common interfaces, a software framework was built that allows for
multimodal inputs and outputs as well as configuring the virtual world and how will data coming
from the real world influence virtual world entities. The data flow originating from the real world
uses a common data format that is standard from the configuration file to the network packets. The
virtual world is modeled such that the actions that affect the virtual entities given input data from
the real world is configurable and extensible. Sharing the environment through the network is also
possible, allowing users from different locations to work on the same virtual world. Preliminary
tests on the gesture recognition performance are presented given several different input modalities
and setups. An experimental application is also described, showing the flexibility and extensibility
features of the software framework.
Résumé
L'interaction homme machine (IHM) est un domaine de recherche duquel les travaux résultants
permettront dans le futur de fournir aux utilisateurs de systèmes informatiques des interfaces dont
l'utilisation sera plus naturelle. Ces interfaces devront idéalement se servir des mêmes moyens de
communication qui sont utilisés dans la vie de tous les jours, soit la parole, les gestes ou autres
expressions corporelles. Dans le cadre de ce mémoire, un module de reconnaissance de gestes dy
namiques continus est présenté. Il permet de reconnaître des gestes exécutés avec des instruments
de suivi de positions et ce, sans segmentation temporelle préalable. Un module d'entraînement
de gestes est également disponible dans le but de construire des modèles de gestes à partir de
séquences préalablement segmentées. Dans le but de fournir aux utilisateurs du logiciel une flex
ibilité d'utilisation accrue ainsi que la possibilité de rajouter d'autres modalités d'entrée et sortie
au système, un framework logiciel a été implémenté. Ce logiciel intégré modélise le monde virtuel
de manière à ce que non seulement les données soient standardisées dans tout le flot de données,
mais également afin de faciliter l'exécution d'actions déclenchées par des données provenant du
monde réel sur des entités du monde virtuel. Le logiciel rend aussi possible la communication entre
plusieurs nœuds d'un réseau, donnant aux utilisateurs le loisir de partager leur monde virtuel. Des
tests préliminaires sur la performance du module de reconnaissance de gestes vis-à-vis différentes
modalités d'entrée sont présentés ainsi qu'une application mettant en évidence les caractéristiques
de flexibilité et extensibilité du logiciel intégré.
Acknowledgements
1 would like to thank my parents for the values they inferred to me, especially respect, excellence
and hard work. 1 would also like to thank my supervisor Dr. Jeremy R. Cooperstock for giving
me the opportunity to take part of the SRE research group and for reviewing my thesis. 1 also
acknowledge his financial support. 1 would also like to thank Dr. Denis Laurendeau and Dr.
Alexandra Branzan Albu who welcomed me and supervised my work at Laval University during
the fall 2004 semester as an exchange student in the course of the QERRAnet program. Thanks
to Frank and Mike for correcting my thesis and with whom 1 really enjoyed working on the famous
"Modellers' Apprentice" table. 1 would also like to thank my brothers (Réjean, Normand, Alain,
Gervais) and friends (Charles, Tardif, Ben, Phil, Filteau, Louis, PO) for the fun we have outside
school. Special thanks to Marie-Ève for her advices and perpetuaI smile. Finally, thanks to the
NSERC for its financial support.
TABLE OF CONTENTS
Abstract .
Résumé ..... .
Acknowledgements .
LIST OF FIGURES
LIST OF TABLES . . . . . .
CHAPTER 1. Introduction
1.1. Context of Research
1.2. Research Problem.
1.3. Thesis Roadmap .
CHAPTER 2. Literature Review
2.1. What Is a Gesture? ....
2.2. Motion Capture Hardware .
2.3. Gesture Recognition Algorithms .
2.4. Virtual Environment Software Architectures.
2.5. Design Goals . . . . . . . . . . . . .
CHAPTER 3. Gesture Recognition Module
3.1. Introduction to Hidden Markov Models
3.2. Choosing the Feature Vector .
3.3. Training Algorithms . . . . .
3.4. Continuous Gesture Recognition
ii
Hi
ix
xi
1
1
2
3
4
4
6
7
9
11
12
12
14
16
17
3.4.1. Hypotheses Generation Aigorithm
3.4.2. Gesture Spotting Algorithm
3.5. Gesture Input Modalities ....
3.5.1. Mouse-based Gesture Recognition
3.5.2. Glove-based Gesture Recognition
3.5.3. Vision-based Gesture Recognition
3.6. Choosing Gestures
3.7. Conclusion.....
CHAPTER 4. Software Framework . . . . . . . . .
4.1. Overview of the Framework's Architecture.
4.2. AData, Generic Data Container .
4.3. Modalities.......
4.3.1. Input Modalities .
4.3.2. Output Modalities .
4.4. Input Tokens Principle .
4.4.1. Input Token Example
4.5. World, World Entities and How to Manage Them .
4.5.1. Components Description
4.5.2. Examples....
4.6. Interaction Manager
4.7. Taking Advantage of the Context
4.7.1. Example ...
4.8. Network Manager.
4.9. XML Configuration File
4.9.1. Input Modality Node
4.9.2. Output Modality Node
4.9.3. World Node
4.9.4. Action Node
4.9.5. Network Node
4.9.6. Discussion ..
TABLE OF CONTENTS
18
19
20
20
21
22
23
24
26
27
29
31
31
33
34
36
37
37
38
39
43
45
45
50
51
52
52
54
54
55
v
TABLE OF CONTENTS
4.10. Conclusion......................................... 55
CHAPTER 5. Results and Discussion .................... .
5.1. Continuous Dynamic Gesture Recognition under Several Conditions
5.1.1. Choice of Feature Vector ........ .
5.1.2. Mouse-Based Gesture Recognition Rate.
5.1.3. Glove-Based Gesture Recognition Rate .
5.1.4. Vision-Based Hand Gesture Recognition
5.2. The Context Grabber's Influence on the Recognition Rate
5.3. The Experimental Application ............... .
5.4. Framework's Performance With a Large Number of Entities
5.5. General Discussion and Limitations. . . . . . . . . . . . . .
CHAPTER 6. Conclusion and Future Work
6.1. Conclusions.
6.2. Future Work
56
56
57
60
61
63
63
64
67
70
72
72
73
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75
APPENDIX A. XML Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84
APPENDIX B. Implemented Components
B.l. Actions ......... .
B.l.l. Action "moveCursor"
B.1.2. Action "reset" ....
B.1.3. Action "traceCursor"
B.1.4. Action "delete" ...
B.1.5. Action "placeImage2D" .
B.1.6. Action "pick"
B.1.7. Action "drop"
B.1.8. Action "translate"
B.1.9. Action "rotate"
B.1.10. Action "undo"
85
85
85
85
85
85
86
86
86
86
86
86
vi
B.l.1l. Action "redo" .......... .
B.1.12. Action "showSystemInformation"
B.1.13. Action "stateChange" .
B.1.14. Action "placeMode13D"
B.1.15. Action "put Texture"
B.2. World Entities
B.2.1. Image 2D
B.2.2. Model 3D
B.2.3. Mouse Cursor
B.2.4. Mouse Trajectory
B.2.5. Text 3D . .
B.3. Input Modalities
B.3.1. Glove-based Gesture Recognition
B.3.2. Mouse-based Gesture Recognition .
B.3.3. Vision-based Gesture Recognition
B.4. Output Modalities
B.4.1. Display 2D .
B.4.2. Display 3D .
B.4.3. Open Scene Graph Display 2D .
B.4.4. Open Scene Graph Display 3D .
APPENDIX C. Sample XML Configuration File
APPENDIX D. User Manual
D.1. Introduction
D.2. Prerequisites
D.3. Installation.
D.3.1. Microsoft Windows Installation
D.3.2. Linux Installation ..... .
D.4. How to Extend the Framework?
D.4.1. Input Modality ...... .
TABLE OF CONTENTS
87
87
87
87
87
88
88
88
88
88
88
89
89
89
89
89
89
89
89
89
90
98
98
98
99
99
100
100
100
vii
D.4.2. Output Modality
D.4.3. World Entity
D.4.4. Action ...
D.4.5. World Hook
D.5. Putting it AlI Together
D.6. How to Use the Gesture Recognizer? .
D.6.1. Training Procedure ..
D.6.2. Recognition Procedure
D.7. Troubleshooting and Advices
TABLE OF CONTENTS
101
102
102
103
103
104
104
105
106
viii
LIST OF FIGURES
2.1 Taxonomy of gestures in Hel (this figure originates from Pavlovié's review [70])
2.2 General architecture of a virtual environment software .............. .
3.1
3.2
Typical hidden Markov model and its constituents
Sample gesture features in two dimensions. . . . .
3.3 Spotting of two circle gestures, which localization ("X") is supposed to be at the circle's
3.4
4.1
4.2
4.3
4.4
4.5
center. The spotted starting point is the dot's location.
Vision-based gesture recognition setup .
UML framework's architecture . .
Detailed view of the data pipeline
AData library class structure . . .
Input modality class structure example
Output modality class structure example
4.6 Sequence diagram for the rendering calI on a three-dimensional OpenGL display output
5
10
14
16
20
22
27
29
30
32
33
modality ............................................ 34
4.7 Input tokens UML class representation .
4.8 World, world entities and actions class structure
4.9 3D model configuration example . . . . .
4.10 Interaction manager and auxiliary classes
4.11 Data saving and undo pro cesses .....
35
38
39
40
42
4.12 Context grabber's class interface ..... .
4.13 Network manager and surrounding classes.
4.14 Network manager's sequence diagram
5.1 Gesture set used for recognition tests
5.2 Large gesture set . . . . . . . . . . . .
5.3 Typical framework's application scene
5.4 Experimental application's gesture dialogue
5.5 Framerate funtion of number of entities (debug version)
5.6 Framerate funtion of number of entities (release version) .
5.7 Interaction manager's processing time function of number of entities
LIST OF FIGURES
44
46
48
57
61
65
66
67
68
69
x
LIST OF TABLES
3.1 Comparison of different feature vectors selection . . . . . . . . . . . . . . . . . . . . .. 15
5.1 Recognition rate for feature selection, including the number of insertions (# ins.) and
substitutions (# subs.) ......................... 58
5.2 Mouse-based gesture recognition rate with improved trained HMMs 60
5.3 Mouse gestures recognition rate for a large number of possible gestures, including insertion
and substitution errors. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Recognition rate of glove gestures with original and improved models 62
5.5 Recognition results for large number of gestures with and without context grabber 64
D.1 Summary of software components that need to be specialized .
D.2 Summary of the gesture trainer keyboard interface commands .
101
104
CHAPTER 1
Introd uction
1.1. Context of Research
Since the earliest room-sized computers, interaction between humans and comput ers has been an
issue spawning an important area of research and development in computer science and psychology,
called human-computer interaction (HCI). In this field, researchers are concerned with the design,
evaluation and implementation of interactive computing systems for human use. The result of this
research has led to several standardized metaphors and paradigms, which are optimized for certain
computing tasks.
Perhaps the most commonplace interaction technique in today's world depends on keyboards
and mice as input devices, and the WIMP (windows, icons, menus, and pointers) for graphical
interaction des pite this being an unnatural method of interaction with a computer, according to
several HCI studies [27,49,94].
A natural way to interact with a computer system would, ideally, allow users to communicate as
they do in everyday life with other people. Such a human-computer interface should support speech
recognition, track body motion, and understand eues regarding the user's emotions and intentions.
Many systems were developed to recognize speech [98], facial expressions [32], body actions [9], and
hand gestures [2]. Specifie devices also exist that allow input of user movements with additional
degrees of freedom beyond that provided by a mouse [20,71,72].
Similarly, it is possible to affect a greater variety of senses than those currently engaged by the
majority of personal computer systems. In contrast, immersive environments, particularly virtual
1.2 RESEARCH PROBLEM
reality, offer a more engaging visual representation. Systems such as the CAVE [24] or Immer
sadesk [26] use large projection surfaces to surround the user visually. Spatialized audio can be
rendered in order to simulate sound coming from definite positions. It is also possible to output
data that would affect the sense of touch with haptie deviees [33] or even smells that would trigger
a person's feelings [61]. Interfaces that allow the user to communieate through multiple input and
output sources simultaneously are called multimodal interfaces.
The aforementioned type of systems is the aim of the Shared Reality Environment (SRE) [19]
at McGill University, in whieh this thesis was accomplished. The target system is a walk-in-and-use
environment, such that a user would be able to interact with the virtual world without extensive
prior training. The interface must therefore be natural to use and should, ideally, adapt to the user's
needs.
1.2. Research Problem
The work accomplished as part of this thesis is based on a set of problems on gesture recog
nition, described in the next paragraph, whieh lead to a second research topie that focuses on the
interpretation of gestures in a multimodal virtual environment context.
Several years ago, Wexelblat [101] formulated a set of questions for the gesture recognition
community, whieh, in large part, remain unanswered. Most of his concerns were related to the
"natural recognition" of gestures, as they are usually performed when people communicate with
each other, and problems such as feature detection and continuous gesture recognition. A feature
is defined as a point of interest in the data stream from the capture device (mouse, glove, video
camera). However, it is not yet known which features are the most relevant in describing a gesture
accurately. These features also depend on the input device that is used; hence no universal solution
is possible. Several different gesture recognition algorithms therefore exist in the literature and their
use depends on the context of the application.
The gesture recognition community is also facing the problems of how gestures are interpreted
by the system and how they affect the virtual world. It is not sufficient to detect the occurrence of
a gesture; the mapping of gestures to their effects in software is essential to having a usable system.
In order to ease the development process needed to build software with which users interact through
gestures, such a system needs to be equipped with a generic and flexible software architecture that
2
1.3 THESIS ROADMAP
is suit able for arbitrary applications. It is also cssential to provide facilities that allow for the
combinat ion of modalities in a single expression sinee gestures are not likely to be used alone when
naturally interacting with a virtual environment, unless the system understands a gesture language
akin to American Sign Language (ASL).
1.3. Thesis Roadmap
The work presented in this thesis is focused on input and output data communication through a
general interface that is suit able for a large number of modalities. The presented virtual environment
software architecture is novel sinee it integrates the data pipeline, which spans from input to output
modalities, with the virtual world management. This is accomplished by means of actions that
apply to world entities linked with predetermined "behaviours", given corresponding input events
and application rules. Additionally, the definition of an XML file format that describes a multimodal
system as weIl as the virtual world constitutes a novelty of the current system.
A continuous dynamic gesture recognition module was also implemented in the course of this
thesis, using hidden Markov models (HMMs) as a statistical classifier. The recognizer's implemen
tation is therefore similar to other comparable systems. The choiee of features of interest and the
gesture spotting algorithm that will be described in Chapter 3 is however original work that aims
at improving results obtained with the existing systems.
Chapter 2 presents a literature review on gestures in communication, that offers insight re
garding different methods for recognizing gestures in an HCI context. Different virtual environment
software is also reviewed, from which the proposed framework was inspired. The chosen gesture
recognition algorithm is then presented in Chapter 3, foIlowed by a description of the different im
plemented gesture input modalities. An exhaustive description of the proposed software framework
foIlows in Chapter 4, with explanatory examples that justify the rationale behind the design deci
sions. Preliminary results are presented in Chapter 5, as weIl as an experimental application. The
thesis concludes with analysis and avenues worthy of further exploration in Chapter 6.
3
CHAPTER 2
Literature Review
This chapter presents a literature review of the major topics relevant to this thesis. Gestures
in communication are presented, followed by a review of gesture recognition techniques. Various
virtual environment software systems are then presented and reviewed.
2.1. What Is a Gesture?
Considerable research has been invested over the past few decades in order to improve the
interaction between humans and computers. One of the objectives of a user interface is that it should
be natural to use. Gestures have been shown to play an important role in everyday communications
between humans [56] in order to express emotions or to augment information conveyed through
other communication channels. Sorne examples of common culturally specific gestures would be the
"okay" sign, the "thumbs up" sign, the large amplitude gesture people make to catch a taxi, the
salutation gesture, and many others. AIso, people tend to gesticulate in order to mimic concepts that
have a spatial dimension which cannot be as easily described with speech. An example of gestures
augmenting speech can be found when a person describes her weekend: "1 killed a caribou that big" ,
while performing a gesture indicating the size of the caribou. One well-known use of gestures in
communication is sign language.
Several researchers studied sign language, particularly Stokoe [87] who defines the structure
of sign language as being described with a hand shape, a position, an orientation and sorne move
ment. Kendon [54] goes further by studying not only sign language, but every kind of gesture that
is performed in everyday life. He classifies gestures with the following categories: gesticulation,
language-like gestures, pantomimes, emblems and sign language, from the less structured semantics
2.1 WHAT IS A GESTURE?
Hand/Arm Movements
Gestures Unlntentlonal Movements
Manlpulatlve Communicative
~ Acts Symbols
~ ~ Mlmetlc Delctlc Referentlal Modalizing
FIGURE 2.1. Taxonomy of gestures in Hel (this figure originates from Pavlovié's review [10])
ta the most structured. Many other researchers in the field of psychology and linguistics have done
extensive research on the raIe of gestures in communication [13,56,73,74]. For a more detailed
survey on how gestures influence the communication semantics, see McNeil [60] who examines the
role of gestures in relation to speech and thought.
Gestures can be classified in many categories: dynamic gestures, static gestures, body postures
and body actions. Sorne systems specialize in the recognition of hand posture in order ta give
commands [6] or for recognizing human actions [62,88,103]. A taxonomy of gestures for HCl has
recently been done by Pavlovié et al. [70] in order ta classify different types of gestures in their
meaning and what kind of information they would provide to a computer system. This taxonomy
can be seen in Figure 2.1.
Manipulative gestures are used to mimic the manipulation of virtual objects. Unfortunately,
this mimicry does not include any sensory feedback from the object(s) being manipulated. Several
brands of haptic deviees are commercially available that capture users' movements and apply foree
feedback ta the hand (CyberGrasp [21]) or active feedback on the entire arm (PHANTOM [92]).
However, these deviees do not offer natural interaction sinee a user needs ta be attached ta sorne
sort of invasive tether or mechanics that remove any sense of naturalness.
On the other hand, communicative gestures do not require any external foree feedback ta be
used realistically, as they are performed in the space as in everyday life. Mimetic gestures mimic
actions that need to be performed on objects (e.g. a circular gesture may mean rotate a particular
5
2.2 MOTION CAPTURE HARDWARE
object). Deictic gestures, also known as pointing gestures, are heavily used while communicating
with other humans.
Abstract symbolic gestures usually represent an arbitrary action or object. There is not nec
essarily a natural mapping between a symbolic gesture and its meaningj therefore it should be the
user's choice as to the definition of each symbol. Baudel and Beaudouin-Lafon [2) argue that the
expected advantages of gesture interactions would be the naturalness of the interaction as weIl as
the richness of gestures and the direct interaction, removing the need of intermediate transducers.
They however point out several drawbacks that a gesture interface would have, namely fatigue, the
fact that gestures by themselves do not necessarily mean anything to the user, and more technical
problems such as the segmentation of gestures in a continuous recognition context. These disadvan
tages however suggest how the gesture set should be chosen and which technical challenges will be
the most difficult to solve.
2.2. Motion Capture Hardware
A gesture recognition system is a compound of hardware and software that first captures in
formation from the real world, then analyzes this information and draws conclusions on what is
happening in the actual world. The features of interest in the input stream coming from the hard
ware can be hand positions when a user is performing a gesticulation or can have more degrees of
liberty by capturing, for example, the position and eurvature of individual fingers as weB as hand
orientation and so forth. The number of degrees of freedom can increase dramaticaBy if every sin
gle physical one is considered in the recognition (e.g. CyberGlove, 22 degrees of freedom for the
fingers [20]), but it provides more flexibility when the gesture is performed. More flexibility how
ever adds more cognitive load on the user [95], particularly if the system does not use the device
effectively.
The next paragraphs describe existing hardware that is used in order to capture movement that
user's hands do. Positioning devices akin to Ascension Flock of Birds [23] or Polhemus [72] are used
to capture hand movement as weB as other moving parts on the body while performing a gesture.
These devices typically use perpendicular magnetic field emitters and sensors in order to measure
and triangulate the position of the worn device. One problem with such material is the physical
6
2.3 GESTURE RECOGNITION ALGORITHMS
tether that links the user to the system. This accessory can limit the user's movement and render
the interaction unnatural. However, wireless solutions reduce this impact.
Untethered environments naturally gain in popularity with the increasing computing power and
capability of today's computers. The majority of so-called free-hand systems use computer vision
as a source of data input. There are also other positioning devices such as the Vicon system [71]
that use passive infrared markers and cameras in order to position body parts accurately. These
are extremely accurate, but also prohibitively expensive equipment. Sorne vision-based systems
detect skin coloured blobs and pro cess them in order to extract useful knowledge of the real world
scene. Others use colour blob and markers detection in order to compute features' positions. These
markers can be located on the user as in Iwai's work [51] or on an external device such as the
VisionWand [15].
Other devices use hand movement and a real-world metaphor in order to navigate through the
environment such as the control action table (CAT) [45], which is a steering-wheel-like device. Other
systems use touch screens or similar kinds of devices in order to position in two dimensions a pen
device on a working surface, namely the Tablet PC [47].
2.3. Gesture Recognition Algorithms
Gesture recognition software includes several components that depend on the type of hardware
that is chosen. For vision-based gesture recognition systems, the first step in the processing pipeline is
the image analysis, which is meant to extract distinguishable features from a large data set. There
are essentially two ways of solving the problem of feature detection: model-based detection and
appearance-based detection. The principle of model-based detection is to analyze images and detect
interesting elements whose virtual representation is known, based on a predetermined constrained
model. For example, in sorne systems, colour blobs are detected, and in others skin blobs are isolated.
These coloured blobs are known to be associated with hands, head, face or other body parts that
match an a-priori known model of a user in a given environment. Blob detection methods rely on
accurate tracking since the system needs to know where they are located at every moment. Hence,
accurate and reliable tracking algorithms such as CONDENSATION [7], Kalman filtering [67],
CAM-shift [106] or mean shift tracking [57] must be used.
7
2.3 GESTURE RECOGNITION ALGORITHMS
As for the appearance-based methods, an operation is computed on the entire image in order
to find general characteristics that would match predetermined models. This processing is generally
a variant of optical flow [25] that is used to compute the movement in the image sequence. Vector
coherence mapping (VCM) [75] is known to extract motion fields in videos, imposing several con
straints on the resulting field. Motion history image (MHI) [8] is a method for finding temporal
changes in a video sequence, thus keeping track of every changing pixel in a history map.
The next step in the processing pipeline is feature extraction. It is an important operation
because it allows classification algorithms to be tractable computationally by reducing the amount
of data to consider. There are several possible operations to perform on the data in order to keep
only the most relevant features. A well-known method is the principal components analysis (PCA),
which is used to keep the feature sets with the largest covariance. PCA is particularly suitable for
appearance-based methods, or when the number of input data is very large. In the case of a small
number of processed features (e.g. model-based vision systems), it is possible ta compute relevant
features using raw input data [82].
The classification phase, also called the recognition process, is one of the critical parts for
a reliable gesture recognition system. It is possible to recognize many types of gestures: static,
dynamic or both at the same time. For static gestures, model mat ching is usually employed in
order to compare incoming data with a previously trained template. For instance, artificial neural
networks (ANN) can classify incoming data given sorne previously trained network models.
For dynamic gesture recognition, sever al statistical methods can be used. Dynamic time warping
(DTW) is employed to align the incoming data stream with a template [28]. Time difference neural
networks (TDNN), the dynamic version of ANN, can classify incoming data given a large amount
of training data [104]. Variants of TDNN have also been developed [58]. CONDENSATION-based
gesture recognition can be used in order to match a dynamic CONDENSATION model with the
incoming data [7]. One of the most popular statistical classifiers used for gesture recognition is
hidden Markov models (HMMs). These have been successfully applied to speech recognition and
can be adapted for gesture recognition [76].
Speech and gestures share common characteristics that justify the use of HMMs for the current
application. Both involve features that change over time and patterns that are repeated. In speech
recognition, an HMM is associated with every phoneme, from which researchers were inspired in
8
2.4 VIRTUAL ENVIRONMENT SOFTWARE ARCHITECTURES
order to define "gesturc phonemes" or "cheremes" for American sign language (ASL) [30]. However,
linguists did not agree yet on a common set of gesture quanta, of which every gesture occurrence
would be composed [97]. Therefore, in this thesis every gesture is associated with an HMM. Sorne
researchers use the classic representation of a hidden Markov model, whereas others try to add more
specificity and thus more reliability in the classification of the models in given contexts by creating
variants, such as coupled HMMs (CHMMs) [12] or parametric HMMs (PHMMs) [102].
The majority of cited systems achieve recognition rates above 90 percent, the recognition rate
being defined as the number of recognized gestures over the number of performed gestures. Most of
them are however tied to a particular application and therefore not suitable for general-purpose use.
Due to those limitations, virtual environment software architectures were investigated in the course
of this thesis. As presented in the next section, the general architecture of virtual environment
software suggests that building user-specifie modules, from the basic building blocks, would be
simplified. The ideal system would aim at finding a way to adapt a maximum number of gesture
recognizers to an interface, allowing users to implement their particular applications. An important
design goal of the current system, flexibility, is therefore to remove any dependency on the hardware
employed for a given gesture recognizer, leading to an abstract input modality interface that needs
to be specialized. The same scheme should be applied to output modalities, as weIl as aIl of the
architecture's abstract components.
2.4. Virtual Environment Software Architectures
An important aspect of virtual environments is how the software is put together in order to
build upon existing software components to accommodate future applications. Object oriented soft
ware frameworks provide a set of abstract classes from which a user inherits in order to include
application-specifie code. A framework is therefore a large piece of code that is extensible for partic
ular applications, while providing building blocks for theoretically any supported type of application.
A general schematic of the composition of virtual environment systems can be seen in Figure 2.2.
Many systems that try to provide flexible ways of building virtual environment applications exist
such as VRPN [90], VR-Juggler [4], Tandem [40], and DIVE [18].
VR-Juggler [4] is intended to be a development environment for virtual reality applications. It
provides a set of class interfaces that a user needs to extend in order to use specifie devices. It is
9
2.4 VIRTUAL ENVIRONMENT SOFTWARE ARCHITECTURES
Input devices
Network communications
World data management
Output devices
Application context
FIGURE 2.2. General architecture of a virtual environment software
also configurable with a graphical user interface (GUI), written in Java and that uses CORBA in
order to communicate with the VR-Juggler kernel written in C++ [46].
VRPN (Virtual Reality Peripheral Network) [90] is a set of classes used to provide a transparent
network layer between devices and user applications. It is not intended to serve as visualization
software, but rather, as an input manager that connects remote devices transparently for a user of
the toolkit.
DIVERSE (Device Independent Virtual Environments - Reconfigurable, Scalable, Extensi
ble) [53] is a virtual environment toolkit rather than a framework. It provides a distributed shared
memory of the state of the world for every remote instance of the environment, as weIl as an ab-
straction for the input and output layers.
Other frameworks include Avango (Avocado [93]), MASSIVE [41], Tandem [40], and blue
c [64], which have different design goals of flexibility and scalability for physical input or output
devices, and shared memory over the network. In the majority of these numerous existing frame
works, the emphasis is put on the abstraction of devices (input and output) as weIl as on network
communication. In the current work, more effort was put on facilitating the way users would want to
configure a multimodal framework and how they would model the virtual world in a flexible manner,
allowing input modalities such as gestures to act on what is rendered in the real world.
10
2.5 DESIGN GOALS
2.5. Design Goals
The goal of this project was ultimately to create vision-based gesture interaction software to
be used in immersive virtual environments. Video input was chosen, because it allows a user to act
on a virtual environment without using any worn device that would be linked to the computer. l
It enhances the sense of immersion and provides the user with a better virtual experience in a
CAVE-like projection system. At the outset of this research on gesture recognition algorithms, it
was noticed that many systems exist to recognize human gesticulation in different situations with a
large number of hardware devices. However, generic systems that involve gestures are not common
in the literature. The generality requirement can be defined as a flexible and easy way to configure
how input gestures will influence the virtual world with which a user is interacting.
To push this further, the framework was implemented with the capability of supporting ad
ditional inputs, such as speech, multiple gesture recognition devices or keyboard. However, the
problem of combining multiple input modes was left for ongoing work. Additionally, a virtual world
model was implemented in order to provide users a flexible way to build an environment with several
types of entities that are to be rendered in several different output modalities (e.g. 3D display, 2D
display, spatialized sound system). Network communications were implemented to share a virtual
world among two or more computers, providing facilities to maintain the coherence, and allowing
for distributed input and output modalities to be instantiated, such that a remote computer could
send events to its clients. The evaluation of the networking module was not part of this thesis, and
is left as future work.
The principal design goals and requirements that lead to the decisions regarding the software
and general architecture design are:
• vision-based gesture recognition system
• visualization of the virtual world using three dimensional views
• flexible way to configure how gestures will influence the virtual world
• multimodal capabilities (input and output)
• networked virtual environment maintaining world coherence
1 In the present case, users have to wear coloured gloves.
11
CHAPTER 3
Gesture Recognition Module
In this chapter, a description of the continuous dynamic gesture recognition module is presented
that includes an introduction of the chosen statistical classifier, the hidden Markov model. Details
of the associated training and recognition algorithms are also provided, followed by a word on the
method used to select features and ending with a description of the three implemented gesture input
modalities. In the context of this thesis, a feature is an interesting data point calculated from the
raw input data stream. The remaining sections of this chapter describe original work on the choice of
feature vectors, the gesture spotting algorithm and the implementation of three gesture recognition
input modalities (mouse, data glove and video camera)
3.1. Introduction to Hidden Markov Models
Hidden Markov models (HMMs) were chosen for representing gesture models in this thesis
because they allow for both spatial and temporal variations in the input data. HMMs are also well
known in the literature, and implementations are freely available [36,39,79,91]. Previous work on
gesture recognition shows that reasonable recognition rates can be obtained using a classical HMM
implementation, with discrete or continuous gestures [1,14,78,80,86,89], Another advantage of
HMMs is that given the chosen features, the state sequence can have a physical meaning that the
spotting algorithm can take advantage of, unlike neural networks, which do not have any meaningful
internaI structure [29].
Hidden Markov models have been used for a long time in the scientific community, but have
recently enjoyed a gain in popularity due in part to the field of automatic speech recognition [76].
An HMM is a collection of random variables with an appropriate set of conditional independence
3.1 INTRODUCTION TO HIDDEN MARKOV MODELS
properties [5]. An HMM can also be described as astate diagram whose states are unknown
("hidden"), each of which has an associated emission probability distribution with probabilities
linked to the transitions between those states. For extensive reviews and applications of hidden
Markov models, sever al tutorials are available [5,38,76,77).
More formally, a hidden Markov model, À, composed of N hidden states, can be defined by
À = (A, B, 71"), whose parameters are as follows: 1
• A = {aij} is the transition probability matrix, where aij = P [qt+l = Sj Iqt = Sil is the
transition probability from state i to state j, S = {Sb S2, .. , Si, .. , SN} is the set of indi-
vidual states and qt is the state at time t.
composed of n continuous Gaussian mixtures that give the observation's probability bj (x)
at a given state qt = Sj for an observation value Dt = x at a given time t. Depending
on how sparse the training data is, a model will be composed of one or more continuous
output observation Gaussian mixtures. It is also possible to have discrete observation
data, but only continuous ones are considered in this discussion .
• 71" = {7I"b 71"2, •• , 7I"N}, where 7I"i = P [qi = Sil is the initial probability of each state.
There are three standard ways that HMMs are used in practice:
(1) ta determine the probability of occurrence of an observation sequence given a hidden
Markov model.
(2) to determine the most probable state sequence explaining an observation sequence for a
given hidden Markov model.
(3) ta adjust the model parameters in order to maximize the likelihood of given observation
sequences.
The first task is useful for assigning a score to each considered model, provided an observation
sequence. It will be used in the recognition stage of the gesture recognition system. The second task
reveals the underlying structure of hidden states that occur for a given sequence and might be useful
to characterize a gestural expression. However, the choice of features in the current thesis does not
require knowing the state sequence. The third task allows hidden Markov models to be built using
1 From that point, continuous hidden Markov models with Gaussian mixtures emission probabilities are considered. For a detailed description of discrete hidden Markov models, see Rabiner [76].
13
3.2 CHOOSING THE FEATURE VECTOR
FIGURE 3.1. Typical hidden Markov model and its constituents
training data. The goal of this offiine process is to optimize HMM parameters in order to find the
best fit with the given training data.
Tasks 1 and 3 will be described in further detail in the following sections. In the current
implementation, each gesture is associated with an HMM having a number of states depending on
the training phase, up to a maximum of seven states. 2 In the present case, the matrix is very sparse,
except for the transitions of astate to itself and to its immediate succession. This specialization of
the hidden Markov model is called a linear or left-right HMM and can be represented as in Figure 3.1,
where the arrows that enter a circIe (state) are transition probabilities.
3.2. Choosing the Feature Vector
Before starting to use hidden Markov models, there must be an operation to extract the most
meaningful information from an input data stream. This operation produces the feature vector and
constitutes data that is passed to the hidden Markov model as observation vectors. In the current
context, the raw input data cornes from the mouse, data glove or video cameras. The gesture
recognition system is based on position capturing systems in order to be able to use these devices
with the same recognition algorithm for later comparison of results. The format of these input
streams is a vector of floating point numbers, either two or three-dimensional.
Different kinds of feature vectors have been considered in order to determine which is better
suited for gesture recognition in the current context. Sorne characteristics that the feature stream
must provide to a corresponding trained model are the following:
2Seven is an arbitrary value, determined empirically.
14
3.2 CHOOSING THE FEATURE VECTOR
• the model must be translation independent in order to be able to perform the gesture any-
where in the environment. The target position of the gestural expression can be recovered
by looking at the data points that constitute the gesture. It is also possible to find an
"application point" or gesture localization by averaging the positions of aIl gesture points.
• the model must be velocity independent in order to be able to recognize a gesture regardless
of the execution time needed to accomplish it. The velocity of execution can be recovered
by looking at how many samples the gesture is composed of, given a fixed sample rate.
• the model must be size independent in order to recognize gestures spanning arbitrarily
large areas. The size of a gesture can be recovered by looking at the bounding box of a
particular sequence.
Features Independent of Dependent on Comment x,y, z Velo city Translation, size, rotation User should always perform the ges-
ture at the same place dx, dy, dz Translation Velo city, rotation Velo city dependence not wanted
r,(),cjJ Translation, size Rotation, velo city Velo city independent if not using r dr, d(), dcjJ Translation, size, Velocity Rotation independence not wanted
rotation
TABLE 3.1. Comparison of different feature vectors selection
The requirements aim at reducing the actual number of gesture models. For instance, two
circIes being performed at different velocities and sizes will be recognized as the same gesture, but
will have different size and velo city parameters when passed to the interaction manager, as described
in Section 4.6. These requirements tend to eliminate several possible features as seen in Table 3.1.
One feature vector that has proven to be appropriate is the direction of the vector spanned by
two consecutive positions, as seen in Figure 3.2. An input data vector at time t will be denoted
Pt = (Xt, Yt, zt). This angle is calculated in 2D with the following equation:
() = arctan (Yt - Yt-l ) Xt - Xt-l
In 3D the two angles calculated are:
(Yt - Yt-l) () = arctan Xt - Xt-l
(3.1)
(3.2)
15
3.3 TRAINING ALGORITHMS
9t-1 Pt-1 __ -.L.. __
FIGURE 3.2. Sample gesture features in two dimensions
(Zt - Zt-l)
<p = arccos r ' where r = Ilpt - pt-III (3.3)
This feature vector is velocity independent since the movement vector's magnitude (r) is not
taken into account. It is also size independent for the same reason. The only dependence of this
feature vector is the rotation, which is in fact desirable because the rotation parameter cannot be
as easily recovered as the velocity and size.
Lee [59] has a similar approach to choosing the angle of the vector spanning the movement as
a feature, except that he quantizes the angles in sixteen steps. An advantage of quantizing values is
that the computation of the observation probability will not imply Gaussian calculation, but in sorne
cases it might be hard to distinguish two very similar gestures. Cao [15] suggests quantizing the area
that a gesture spans, such that every point of a gesture sequence will be located in a quantized area.
Using this type of feature, it is possible to recover complex speed-dependent gestures. However, for
continuous gesture recognition, extra processing is needed in order to segment correctly start and
end gesture points with an arbitrary number of frames considered for recognition. Similar results
concerning three-dimensional feature selection can be found in Campbell's work [14].
3.3. Training Algorithms
The goal of a training algorithm is to build a gesture model for later recognition, given known
pre-segmented gestures. A well-known HMM training method is the Baum-Welch or forward-
backward algorithm [5,76]. It finds the model parameters that maximize the probability of oc-
currence for given observation sequences. One drawback of the Baum-Welch method is that it
optimizes the model parameters over all possible state sequences, rather than only considering the
16
3.4 CONTINUOUS GESTURE RECOGNITION
most likelyone [52]. Another HMM training algorithm proposed by Juang and Rabiner [52], called
the segmental K-means algorithm, alleviates this problem. Instead of finding the best model that
matches observations for aIl state sequences, it finds model parameters that optimize the score only
for the best state sequence found with the Viterbi algorithm [34], which also returns the associated
score. The HMM parameters are then re-evaluated until they converge to optimal values [52], given
a certain threshold. LTI-Lib [79] was chosen as an appropriate implementation of hidden Markov
models data structures and their associated algorithms. 3 The HMM training method used in the
current implementation is the segmental K-means algorithm.
The training process for the gesture recognition system relies on several repetitions of a given
gesture as an input to the trainer. 4 This pro cess is also called supervised learning since the user who
provides data knows how reliable it is. Providing isolated gestures is a requirement for such a training
process, which involves an external clutching mechanism that notifies the system when a gesture
st arts and ends. When using a mouse for gesture input, the depression and release of the mouse
but ton offers such a mechanism explicitly. However, in the case of vision-based gesture recognition,
which is not supposed to use any external device for data grabbing, a user needs a way to indicate
the beginning and end of a gesture. Hence, a keyboard-based user interface has been implemented
that allows one to manage the training pro cess in creating new gesture models, indicating begin and
end points of gestures, saving and loading files and deleting models. The training pro cess is thus
different from the continuous recognition stage, as the latter does not need to be told the beginning
and end points of gestural expressions.
3.4. Continuous Gesture Recognition
This section describes how gestures are automatically recognized and extracted from an input
data stream. Continuous gesture recognition has recently been studied [16,101,102], and is defined
in the literature as a pro cess by which gestures are isolated from an input data stream without
needing to specify start and end gesture points explicitly. Kendon found that when a gesture is
pcrformed, there is a preparation and a retraction phase that are relativcly easy to detect [54].
Quek [73] establishes several rules based on the observation of expressive gestures that constrain the
3LTI-Lib is an open-source object-oriented library built in C++ that uses STL containers as a basis for data structures. This library provides algorithms and data structures that are often used in computer vision. 4The number of repetitions is typically on the order of twenty for users not to be overloaded.
17
3.4 CONTINUOUS GESTURE RECOGNITION
recognized gestures to have a certain start and end position, as weIl as predefined movement phases
in order to help gesture segmentation. One goal is not to impose any starting position or constraint
for gestures since they must be performed as naturally as possible. Hence, there is need for a gesture
spotting algorithm that detects the end of most probable gestures and draws conclusions accordingly.
The next sections present the algorithm that is used to generate hypotheses from the input data
and known gesture models, and a description of the gesture spotting algorithm.
3.4.1. Hypotheses Generation Algorithm. The usual way to recognize an isolated
gesture is to apply the Viterbi algorithm to every trained model and to select the one with the largest
score.5 Obviously, the normalized highest score should be higher than a normalized threshold in
order to be valid.
For a continuous recognition system, it is unfortunately impossible to apply the Viterbi al-
gorithm directly because the starting point of a gesture is unknown. The system should therefore
generate hypotheses every time a new feature vector is added to the recognizer's feature vector input
stream. Gesture hypotheses are ranked using an inverse Log-likeIihood scoring strategy, keeping the
best gesture end hypothesis at every time step. A gesture end hypothesis is created when a gesture
hypothesis has reached the last state of its associated HMM. The procedure as it is implemented in
LTI-Lib is summarized in Algorithm 3.1.
Algorithm 3.1 Gesture hypotheses generation
expand the current best hypothesis to obtain a reliable pruning value active hypotheses {::: generate new hypotheses from every known valid hidden Markov Model perform the Viterbi step on aIl active hypotheses new pruning threshold {::: prune the active hypotheses with bucket sort keep track of the gesture end hypothese in a trace-back field
For every input feature vector, new hypotheses are generated and a score is calculated in order
to prune hypotheses that are not likely to have occurred. The pruning threshold is calculated such
that a limited number of hypotheses is maintained.6 Even if the scores are not of the same scale,
they are compared against each other, which typically increases the likelihood of longer gestures.
SIn the present case, inverse of the Log-probability is used. Therefore, the most likely model would be the one with the lowest score. 6 Typically, 100 hypotheses are kept.
18
3.4 CONTINUO US GESTURE RECOGNITION
3.4.2. Gesture Spotting Algorithm. Considering hypotheses scores exclusively does
not allow for detecting a gesture and its endpoints. A fixed threshold on hypotheses score would
not provide reliable recognition since these scores vary for every gesture model. A gesture spotting
algorithm is therefore needed in order to isolate a gestural expression. Lee [59] developed a threshold-
based gesture spotting algorithm that takes every node of aIl known hidden Markov models to for~ a
threshold model. Conditions for spotting gestures are well-defined, but extra processing is necessary
in order to compute the threshold model's score, which can lead to slowdown in performance. An
alternative algorithm that uses the topology of hidden Markov models was developed in the course
of this thesis. Gestures are spotted based on Algorithm 3.2, wherein a gesture end refers to the
termination of a gesture.
Algorithm 3.2 Gesture spotting
initialize a stack of gesture end hypotheses if current best hypothesis reached the last state of its model and it is the most likely gesture then
push the hypothesis on stack if the hypothesis becomes less likely, or the best hypothesis' inner state is not the last model's state then
gesture is spotted and the stack is emptied if the most likely hypothesis is not the one on the top of the stack then
if most likely hypothesis and hypothesis on the top of stack have the same size then pop the stack and push the new hypothesis
else spot the gesture and empty the stack
The topology of an HMM, with the movement's angle as the feature vector, gives an idea of
the gesture's shape because each state represents a part of the gesture. When the end state is
reached, it means that the current best gesture hypothesis is Iikely to be true. The latter is a vaIid
assumption for gestures whose complexity is such that it is not likely that they will occur during
random movements of a user. The end of a gesture sequence is found using Algorithm 3.2.
On the other hand, the st art point of the gesture hypothesis is known since its length is being
incremented every time a new feature vector is added. However, the starting point is not always
accurate, especially when the first HMM state is the most likely for a long period of time, as
illustrated in the left of Figure 3.3, in which the gesture was performed with a long preceding trail.
Incorrect starting point detection occurs due to the similarity between the trail and the first HMM
19
3.5 GESTURE INPUT MODALITIES
x Wrong spotting Correct spotting
FIGURE 3.3. Spotting of two circle gestures, which localization ("X") is supposed to be at the circle's center. The spotted starting point is the dot's location
state most likely feature vector, leading to faIse gesture localization. This is a consequence of the
hypotheses pruning algorithm, which favors longer gestures.
There are severaI solutions to this problem, one of which would be to impose user constraints
on the manner in which the beginning of a gesture should be performed, as seen in the right of
Figure 3.3. This would however increase the cognitive load on users since they would have to
remember to perform a special notch in the trajectory before performing the actual gesture. A
compromise would be to indicate to the user at every moment where the system thinks the starting
point of a gesture is located. This could easily be done by drawing an indicator of the starting point
on the gesture's fading tail trajectory. Users would then be able to predict when a spotting error
is to occur and take correcting measures accordingly. The ideal solution is however to use gesture
models that are sufficiently different from random movement, such that no confusion is possible for
the recognizer.
3.5. Gesture Input Modalities
The following section presents various input modality software modules that were implemented
over the course of this thesis. It should be noted that every implementation shares the same interface,
as specified in the framework's description in Chapter 4.
3.5.1. Mouse-based Gesture Recognition. A mouse is a popular pointing device that
is used in everyday life by most people who work with desktop computers. It is in fact a two
dimensional relative hand position tracker. In many applications such as web browsers [63,85] or
games [48], mouse gestures are used in order to perform common operations (e.g. forward, back, new
window, express a creature's emotions). However, in sorne ofthese systems, gestures are triggered by
20
3.5 GESTURE INPUT MODALITIES
a mouse buttonj they are therefore not continuously recognized. The continuous gesture recognition
can be used for detecting motion patterns that a user performs when moving the mouse on the
desktop or in the virtual world. Unlike usuaI systems, a continuous gesture recognition system
should not, by definition, use mouse buttons in order to trigger the starting and ending point of the
performed gesture.
Multiple mice can be used in the current implementation, which renders possible bimanual
mouse interaction. For that purpose, the RawMouse [22] Windows API must be caIled, whose
methods are only available on Windows XP. They are used to bypass software drivers in order
to receive raw movement and buttons' status from multiple mice. Similarly on UNIX systems, X
events can be used in order to capture mouse movement, several mice at a time. In order not to
overload the gesture recognizer, the mouse frame rate is down sampled from 125 Hz to a maximum
of 30 Hz.? More details on bimanual interaction are presented in Guiard's work [44]. For example,
bimanual interaction could be used to localize gestures accurately, using one hand to perform the
actual gesture and the other to localize the gesture application point.
3.5.2. Glove-based Gesture Recognition. Data gloves such as CyberGloves [20], or
other position capture devices have been used in numerous systems. They allow for rich interaction
because the position they capture is reasonably accurate, and most of them include bending sensors
on the fingers. The glove interfaced in the course of this thesis is the P5 Glove, which is intended to
be used by video game players [31]. It provides 3D position, orientation, finger bending as weIl as
but ton press input data. The position and orientation are triangulated by sensors on the glove that
receive infrared signaIs from an emitting "tower". At least three of the glove's infrared sensors must
be in the field of view of the tower's signaIs in order to obtain reliable data. Since the data coming
directly from the P5 glove is less accurate than other more sophisticated devices, there is a need for
a filtering stage to be done. A KaIman filter as weIl as a morphologie al filter were implemented in
order to clean the noisy measurements received from the P5 drivers. 8
One of the drawbacks with this kind of glove is the area that can be covered, which is limited by
the length of the cable linking the glove to the tower. The gesture recognition is, for the moment, only
7125 Hz is the maximum mouse sampling rate on Windows XP. BThe filters were implemented by François Dinel from CVSL at Laval University.
21
3.5 GESTURE INPUT MODALITIES
performed with the 3D positions without considering finger bending. However, with the archit.ecture
that will be described in the next chapter, recognizing static gestures of the hand is possible.
'-;--,..-----1 Tracker 2
FIGURE 3.4. Vision-based gesture recognition setup
3.5.3. Vision-based Gesture Recognition. A general description of the video-based
gesture recognition system can be seen in Figure 3.4. After being digitized by frame grabbers,
images are passed to an edge-based tracker that analyses and keeps track of the extracted edges.
Time-differencing motion is used in order to detect moving entities from one frame to the other.
A background training stage is also performed in order to differentiate background and foreground
edges. Colour blobs inside the edges are taken into account, and are used as a validation stage for
object consistency.
For the moment, in order for the user's hands to be tracked, one has to wear coloured gloves
(blue and green), so that the tracker will be able to disambiguate between the two hands. As the
tracking of skin-coloured blobs becomes more reliable, a user will not have to wear any aceessories
at aH. However, sinee the tracker is still under development this technique was chosen as being the
least invasive for a user. The colour blobs are tracked over time and their 2D positions are sent to
a camera integrator that generates 3D positions given two distinct camera views.
The relative topology of Camera 1 (in front of the user) and Camera 2 (above the user) that
can be seen in Figure 3.4 aHows for the extraction of 3D positions. Camera 2 is used to recover the
x and z coordinates, whereas Camera 1 is used to recover the y dimension. No stereo matching and
22
3.6 CHOOSING GESTURES
camera calibration is done for the moment, which leads to imprecise hand positioning. Therefore,
gesture models are only valid for a particular placement of the cameras, and each time they are
moved, there is a need for a new gesture training phase.
The network communication of the blob positions is ensured by a "server library" that runs a
daemon on each of the computers involved in the pro cess. 9 A naming service is used in order to
facilitate the connection between two peers involved in the data transmission process. Message data
consists of plain text commands formatted to be understood by the receiver. Commands and data
are sent with the principle of remote method invocation (RMI) in mind, but with less flexibility
and ease of use since the marshalling is not done automatically. The network communication would
benefit to using middleware such as CORBA [68] or similar RMI systems [105]. This issue will be
addressed in Section 6.2 discussing future work.
3.6. Choosing Gestures
A natural gesture is defined as a motion pattern that people tend to use in their everyday
interpersonal communications. Recognizing natural gestures performed by users in immersive en
vironments is the ultimate goal to achieve for a gesture recognition system that aims at natural
interaction between the user and the system. It is however quite hard to determine which naturally
performed gestures bring meaningful information to the system. Apart from sign language, there
is no convention regarding a gesture set and semantics that would allow people to communicate
with each other. There are however symbolic gestures recognized within cultures (e.g. "okay" sign,
waving hand) that are meant to support expressive communications or grab one's attention when
it is impossible to do so with speech. The main use of gestures in everyday life is the deictic form.
Deictic or pointing gestures are used to refer to spatial characteristic of objects. They are heavily
utilized in or der to grab other people's attention on spatial details or to give orders like "put that
there" .
Systems that make exclusive use of gestures in order to control a virtual world exist. Baudel and
Beaudouin-Lafon developed CHARADE [2]. This system however has very few available commands,
which are chosen to be as "natural" as possible. For more complex systems in which commands
are more numerous, there is an issue on the user's cognitive load when the number of gestures
9The "server library" was developed by Jeremy R. Cooperstock.
23
3.7 CONCLUSION
to remember increases. This is particularly true when no mapping exists between the command
and a real world's gesture. To alleviate this problem, user interface widgets can be used. We have
developed the pieglass [11], a widget controlled by bimanual pointing gestures. Pieglasses allow for a
mapping between commands that do not have a common gesture representation to pointing gestures
that are used to select tools and actions. An advantage of having a layer between gestures and
commands in the virtual world is that the cognitive load imposed on the user is lessened. However,
for gestures that can be mapped to real world's actions, it is obviously an advantage for a user to
execute directly the natural action without passing by the widget as long as the gesture recognition
system is reliable.
Choosing gestures for reliable interaction is of crucial importance because a gesture recognizer
is not perfect. There are always recognition errors that occur due to confusion with other gestures
or too much dissimilarity between the current sequence and the model, which does not lead to any
recognized sequence. In general, gestures must be distinguishable enough, so that a user will not
invest a large effort in trying to disambiguate the recognition system. Chosen gestures should also
be distinguishable from non-gestures, so that when a user is moving around the motion will not be
recognized as a known gesture.
3.1. Conclusion
To conclude with the gesture recognition implementation module, hidden Markov models are
used in order to recognize gestures performed by users when controlling a virtual environment. Most
of the work focuses on iconic and deictic gestures. Iconic gestures are used as triggers for commands
that do not necessarily have a natural mapping to a gesture. An example of a natural mapping
would be to draw a circle around a virtual object in order to select it. An example of non-natural
mapping would be a gesture perfarmed to texture an object, as users would be unlikely ta agree on
a common, appropriate representation for a texturing gesture. Deictic gestures are used to point at
objects and move them around in the virtual space.
It is clear that using gestures alone is not the best way to interact with a computer, indicated
by the fact that in everyday life people use multiple modalities in order to communicate with each
other. Language (speech and writing) is the most structured form of such communication. It has
to be learned and practiced for a person to be proficient at it. Language is good for describing
24
3.7 CONCLUSION
and naming things, but is quite limited for spatial description. This is why multimodal systems
exist: gestures are used for operations involving spatial parameters, whereas speech is preferred to
give commands. Both can be used simultaneously, but there is then a need for integrating the two
modalities in a consistent way, based on a predefined grammar [96J.
Despite the fact that the more accurate a recognizer becomes, the better the user's experience
will be, interaction with a virtual environment does not rely on the recognizer alone. The software
that includes the recognizer has a large role to play in the realism of a user's virtual experience. One
requirement is a mapping as to which performed gesture will result in which action in the virtual
environment. In order to map tokens in an input stream onto virtual objects effectively, a flexible
and generic software architecture is needed to ease the configuration of how the system will respond
to user's actions. The architecture designed to meet these goals is described in the following chapter.
25
CHAPTER 4
Software Framework
The generality constraint of the proposed software architecture cornes from the following require
ments that users might desire. Many input modalities may be exploited by different users and several
output modalities employed in the rendering of a virtual environment. The desired mapping of ges
tures or other input modalities to consequences on virtual objects can vary across users. FinaIly,
additional input or output devices can be deployed at runtime. A software framework is intended
for solving the problems of generality, flexibility and extensibility. The basis of the architecture is
a set of classes that define interfaces which should be generic enough to fit the user's needs, and
provide mechanisms that would help a programmer to specialize the framework for different appli
cations. Every user-specifie component is placed in a dynamically linked library that is to be loaded
at runtime, as specified in the software initialization process.
The presented software was written in C++ to take advantage of several mechanisms such as
polymorphism and inheritance that render the architecture more flexible. Most of the libraries that
provide functionality to the system were only available in C++, which constitutes another motivation
for implementing the framework with that programming language. In order to be portable on most
operating systems, the software uses the ACE (Adaptive Communication Environment) OS wrapper
library [81] to perform operations that are not standard on aIl operating system platforms such as
network sockets or threads. Unified modeling language (UML) was used to model the software
and visualize the relations between the different software components. Several UML class diagrams
are included in this chapter, showing the class relations as weIl as the most important attribut es
and methods. For a quick reference on the UML graphical notation, see the Object Management
4.1 OVERVIEW OF THE FRAMEWORK'S ARCHITECTURE
Group (OMG) specifications [43]. XML (eXtensible Markup Language), which is a text format that
employs the tagjattributes (or markup) metaphor in order to represent tree-like data structures, is
used as a configuration mechanism to facilitate application development. For the reader who is not
familiar with the associated notation, see Annex A.
In this chapter, the implemented virtual environment software framework is presented with an
exhaustive description of aIl its constituents and configuration mechanisms.
+m_XMLConfigurationParser
XMLConfigurationParser
#m_networkManager
+m_ outputManager
+m_inputModalities o .. " +m_worldManager 0 .. * +m_outputModalities
! InputModality 1
. 1
r! W~O'>"rld-M.:r.a-n-a-ge-r-'! 1.!.C?'!t!!yt~~~~!itt 1 ---i 1 -1 1 •
FIGURE 4.1. UML framework's architecture
4.1. Overview of the Framework's Architecture
The framework's software architecture can be seen in Figure 4.1 in logical UML specification.
The class of type Instance is the interface class that needs to be instantiated by a user's application
program. It actuaIly defines an instance of a virlual world as weIl as the data pipeline that extends
from input to output modalities, including the different actions that can be applied to the entities
that compose the world.
In Figure 4.1, the different software component managers are shown, being the InputManager,
OutputManager, InteractionManager, WoridManager and NetworkManager. Every manager can be
configured using actual code or with an XML file, which is described in Section 4.9. The configuration
data corresponding to every manager is passed to an initialization method called init. The role of
every manager followed by a brief description is presented as follows:
27
4.1 OVERVIEW OF THE FRAMEWORK'S ARCHITECTURE
• InputManager: this manager instantiates and st arts input modalities, according to the
user's specifications.
• Interaction Manager: this manager links input and output modalities. In the initialization
function, a grammar is configured and corresponding actions are instantiated. At runtime,
this manager spawns a thread that processes data received from input modalities and
influences the virtual world according to the grammar loaded in the initialization method.
• NetworkManager: this manager handles networking operations, namely the transmission
of input events, the world coherence among multiple instances and the management of
resources shared among participants. It can instantiate a server or connect to a remote
virtual environment while handling incoming connections and received data.
• Output Manager: this manager instantiates and configures every output modality that
the user needs. Examples of these modalities could be 2D or 3D displays, a 3D sound
system or a haptic device.
• WorldManager: this manager instantiates and initializes every world entity of which the
virtual world is composed. It is also used to set the behaviours and add corresponding
properties to world entities, a procedure that will be described in detail in Section 4.5.
As can be seen from the previous descriptions, most of the managers are only used during the
initialization stage or to ensure the data structures' coherence and management. Nevertheless, they
play a role of crucial importance for the system to work properly.
Figure 4.2 shows the dynamics of the data pipeline, where clouds are worker threads. Every time
an event occurs on the input side, an input token is emitted and added to the interaction manager's
message queue. The latter call is asynchronous, as it is invoked from an input modality's thread.
When an input token is put in the queue, the interaction manager parses it in the processing thread.
The actions that match criteria, as described in Section 4.6, are then applied to world entities, which
are aggregated in the World class that is itself a WorldEntity object type. In order to obtain feedback
from the virtual world to the real world, output modalities have to be instantiated and entities
rendered. The latter is accomplished by calling the render method on an output modality that will
subsequently call the render method on every entity contained in the world. Each entity knows
how to render itself in a given output modality. The entity's properties are used to set variable
28
4.2 ADATA, GENERIC DATA CONTAINER
Input modality
Instance
Interaction manager
Parse token 1
Emit Id±: queue
token
Apply
Render--' Output modality r.-Render
FIGURE 4.2. Detailed view of the data pipeline
Action
World and entities
parameters. The rendering pro cess is generally initiated with a calI coming from the Instance class,
which owns a reference on output modalities.
4.2. AData, Generic Data Container
The AData library provides the fundamental data storage mechanism used throughout the
description, and is a prerequisite for having a framework whose internaI data format is universal. 1
An AData is a generic data container that allows a user to have a common data type in order to store
variables whose type is to be fixed at runtime. That is, aIl specialized data structures are inherited
from the AData class. As seen in the UML diagram of Figure 4.3, an AData object holds a pointer
to a class of type ADNode, the concrete data container that is reference-counted. The AData class
and its specializations contain aIl the needed methods in order to access the data pointer; therefore
it implements the bridge design pattern, which separates the interface from its implementation [37].
The AData class also manages the container's memory allocation, acting as a smart pointer for the
con crete data types.
There are sever al formats in which data can be stored. At the moment, two types are supported:
STL (ASTL, from standard template library) and XML (AXML). The former data classes inherit
from both ADNode and an STL data container class such as map, set, vector, list. There is also a
IThe software library A Data, which stands for APIA Data, was developed at the computer vision and systems laboratory (CVSL), Laval University, as part of the APIA (Actor, Property, Interaction Architecture) project [3].
29
4.2 ADATA, GENERIC DATA CONTAINER
AONode (tom ASTl,) AOata
~refcount : inl l)'romASTl,) -'name: Sld::string -'updateFlag : bool +ptr_ f'ptrQ : ADNode"
, ~sValidO : bool "'SetUpdateFlagO : void .operator ->0 : AONode" .CloneO : AONode" .operator ADNode"O .DuplicateO : ADN ode" ~trQ : const AONode" .ReleaseO : int .stringO .ResetUpdateFlagO : void
Î f AONStrng String (fromASTl,) (fromASTl,)
~Iue : std :string
·CloneO .IsValidO ·operalor ->0
FIGURE 4.3. AData library class structure
class template for simple types, namely double, float, integer, string. The XML format, on the other
hand, is used to store data in a document object model (DOM) whose document type is predefined.
It is possible to read XML files to the AXML data format and vice-versa. In order for XML data to
be used in the application, converters exist for transforming data from AXML to ASTL. Similarly,
ASTL data can be converted to AXML, which is to be used for data serialization. The structure
data type does not exist in ASTL, but it is possible to build a structure-like data type with the
template Map<AData, AData>. The first template argument should be of type String, being the
identifier and the second ean be of any type, being the member.
XML was ehosen as the format in which an AData would be stored in files beeause it is a
human-readable text format and it would eventually be easy to include AData in more complex
configuration files, as described in Section 4.9. XML parsers such as Xerces [35] also allow for data
validation, provided a document type definition (DTD) or more sophisticated descriptors sueh as
XML Schema, which is an essential feature for complex data representation.
In addition of being helpful for file input/output, AXML is used as the data format transmitted
over the network. It certainly introduces a significant amount of overhead due to the redundancy
of information in the XML format, but, where necessary, it would be possible to write a converter
to serialize and deserialize paekets sueh that the quantity of data pushed on the network would be
reduced.
30
4.3 MODALITIES
The abstract factory design pattern is used in order to instantiate custom converters from
dynamicaIly loaded libraries. Henee, facilities are provided in order to extend the library for users'
needs. The rationale of using the AData library in the current framework cames from the fact
that the information that passes through the data pipeline is not known at compilation time, but
is configured by users at runtime. More details are presented in the foIlowing sections as to where
AData are used in the current framework.
4.3. Modalities
Considerable research effort has been invested on multimodal systems in the past decades [10,
55,96]. Multimodal interaction aIlows for rich communication between the user and the system.2
Modalities, being inputs or outputs, are usually associated with devices (e.g. data glove, video
camera, mouse, video card, sound card), whose interfaces are not trivially adaptable for every
encountered application. This section presents class interfaces that aim at providing an abstraction
such that several input and output modalities will share a common data format and pipeline.
4.3.1. Input Modalities. Systems such as VRPN [90] make the differences between several
deviees transparent to a user with the introduction of a common interface, leaving the problems
of data pipeline and virtual world management unresolved. U nlike VRPN, the current software
framework integrates input and output modalities as weIl as the virtual world in a standard data
pipeline that is configurable at run time. A VRPN stream could however be used as an input
modality of the current framework given that there exists a data converter which transforms data
from VRPN data types to the AData format. In fact, any kind of event-based input modality is
technically supported, though introducing an overhead when building input tokens.
Figure 4.4 shows the class structure needed for a vision-based gesture recognition input modality
implemented for meeting the associated design goal and showing the flexibility of the proposed
architecture. The DynamieGestures class inherits from the InputModality class and implements the
abstract method emitToken. In order for the input modalities to be as generic as possible, a recognizer
is aggregated to the modality instead of being inherited. The DynamieGestureRecognizer class also
inherits from the ACE_Thread class in order to spawn a worker thread that reads data coming from
deviees or from the network. However, the class is not specialized enough to implement the sve
2 Multimodal systems however need a modality integrator, which is ongoing work in the SRE laboratory that is planned to be integrated with the current framework.
31
4.3 MODALITIES
I"""ModaIily 'ifm_name: sld::string 'ifm started: bool
'IogOala() .«BImI,sct» etritTokenO • <<virtual» finiO .<<virtual» in~O .<<virtual» startO .<<virtuol» stopO
~ OynomicGeslure.
HMMGeSureR9COgnizer (roM o,n •• Io~cvnllo,,)
'ifm_emissible T okens . sld::map<int, std::slrir... dedWol ... 'ifm_spotledWords: std::slack<Exten
1'«BImII8Ct» <NC{) .<<virtual,const» getGeslul9Applic
1'<<virtual» addOnlineOalaO .<<virtual» finiO .<<virtuol» in~O
ationPoir ...
----------~
OynsrricGeetUlflFlecognizer 1
(fnm DynoimioGulureRuognHion) M'n_recognize ftcIIDytr .. Ioa..~ItIotr}
'ifm_done: bD 01 1 1
·emilTokenO 1'«8Im1rsc/» ......:0
1
~n~ .«Yirtuol» finiQ lIm_modal~y .«Yirtual» finiO
.«Yirtual» startO .«Yirtuol» inno
.«Yirtual» stopO .«Yirtuol» slortO 1 .«virtuol» slopO
Vi sio n80sedHMM3esl"eReco gniz ... (from DynamloGesb.neReooClnttlon)
~ Qbj,ctsPQljtjgos' dd' 'ysctQr<PQinf
.«virtuol,const» gelGestureApplicatil .. 1'«virtuol» sveQ .<<virtual» finiO .«virtU81» in~O .«virtuol» porformRocognHionO .<<virtual» slopO
FIGURE 4.4. Input modality c1ass structure example
method, which is the ACE_Thread's entry point method. It is an example of the adapter design
pattern [37], which allows classes to work together even if they have different interfaces. In the
present case, it decouples the ACE_Thread interface from InputModality.
Following the class hierarchy is the HMMGestureRecognizer that inherits from ItiHMMOnlineClas
sifier, the hidden Markov models classifier implementation of the LTI-Lib library [79]. The svc
method is not implemented in this class either, because it is still hardware independent and not
specialized. It only provides a hidden Markov models interface that can be used by any kind of
device that needs such a recognizer. The addOnlineData method with its data vector parameter
should be called in order to initiate the recognition process. More details on the gesture spotting
and recognition algorithm can be found in Section 3.4.2.
When a gesture is recognized, the corresponding input token is emitted by a call to the emitToken
method. Finally, the concrete implementation of the svc method belongs to the VisionBasedHMMGes-
tureRecognizer class. The thread method actually starts a server and waits for incoming data, calling
static call-back functions corresponding to the incoming messages. Feature vectors are calculated in
the performRecognition method and then fed back to the associated recognizer in order to continue
the recognition process.
A data logging mechanism was implemented in order for users to be able to diagnose probJems
in the system, or simpJy play back data that was recorded using the same Jogging mechanism. The
InputModality class finalJy allows for interfacing numerous kinds of devices that can be accessed
locally or through the network, depending on the available hardware configuration.
32
4.3 MODALITIES
Ou/pA Modally Di.play w.m_name: std::string (ham Ot,phiœOutpuQ Display:D
w.m_.enderingBehIMou,: std::string w.m_height: ;,t ( .... G".hl..o",,~
w.m_type: std::string r<J- w.m_width : int <J-- .«virtual» fini()
.«virtual» finiO ·setHeightO .«virtual» iniIQ
.«virtual» initO ·setWidtl() .«virtual» renderWorldQ
.«viltuat» rendetWorld() .«vi~ual» renderWoridO
FIGURE 4.5. Output modality cIass structure example
4.3.2. Output Modalities. On the other side of the data pipeline are located the output
modalities that are meant to represent virtual data in the real world. In desktop-based computer
systems, the monitor is often the only output channel through which a user receives information
from the virtual world. In virtual reality, three-dimensional display systems are typically used along
with immersive environments in order to render a virtual world as realistically as possible, such that
a user will feel immersed. Several systems limit users' sense of immersion to visual effects [53,64].
However, vision is not the only human sense that would benefit the virtual data to be rendered:
sound is a trivial way to provide feedback to the user, and haptic devices are becoming very reliable
in realistically rendering tactile effects. It is therefore important for a multimodal framework to be
adaptable to different needs and provide a way of passing data that is sufficiently generic, such that
it could theoretically handle any kind of output modality a user desires.
The base class that is the interface from which concrete output modalities inherit is Output-
Modality (Figure 4.5). The interface method needed to be implemented is renderWorld, which should
be specific to every modality type. The default operation is to calI the render method on every entity
if rendering is needed, since it is possible to disable the rendering of an entity in order to hide it from
the real world. Output modalities own an optional rendering behaviour member attribute, such that
only entities owning this behaviour will be rendered. More details on the behaviour mechanism will
be presented in Section 4.5.
The output modality example shown in Figure 4.6 is an implementation of an OpenGL three-
dimensional display. The implementation of the renderWorld method sets the perspective view matrix
for correctly displaying 3D objects, calls the output modality's renderWorid method and sets the view
matrix back to its original value. Similarly, a 2D OpenGL display will set the orthographic view
first, render the entities that need to be rendered in 2D and restore the original view matrix.
33
4.4 INPUT TOKENS PRINCIPLE
1 MO~DlI 1 OgenGL API 1 1 Wodd 1 Entjty 1
œndeNVorld 0 Sot p ersp octive view !
~----------------~~ ,end~,
! GL commands
~---------------~ """-------------------~-----------------
Resto,e initial view :
~---------------_:~ 1"""---------------
FIGURE 4.6. Sequence diagram for the rendering cali on a three-dimensional OpenGL display output modality
In order to use already implemented OpenGL output modalities, a user has first to create an
OpenGL rendering context and calI the renderWorld method from the thread in which the rendering
context was instantiated. For example, in a Microsoft Windows MFC implementation, caBs to the
rendering functions have to be performed in the view's OnDraw method that originates from the
main loop. Pointers to output modalities are previously retrieved from the output manager. In a
GLUT application, caBs to the rendering methods occur from the draw function that also originates
from the main program's loop.
No other types of output modalities were implemented in the course of this thesis, but an the
building blocks are in place in order to do so. Rendering caUs are not specifie to particular devices
or software interfaces and thus, several kind of systems are possibly adaptable to the one provided
by the current framework.
4.4. Input Tokens Principle
It was mentioned earlier that input tokens are emitted from input modalities to be parsed
subsequently by the interaction manager. In this section, details are provided as to the composition
of input tokens and their role is explained in the generality of the framework.
34
4.4 INPUT TOKENS PRINCIPLE
InputToken ~_instance : int ~ needPublish: bool ~_probabilny : double ~_source : std::string ~_timeStamp : unsigned long #rn_dala Il AData li ~ tokenlD: sld::slrilg
(fromAST~
~deserializeFromDalaQ ~ublishQ ~«consl» gelDataO ~«const» seriaüzeToDalaQ ~«friend» 0 peral 01'«0 ~«friend» 0 peral 01'»0
FIGURE 4.7. Input tokens UML class representation
The 1 n putT oken 's UML class specification can be seen in Figure 4.7. A list of the most important
parameters that define an input token's content follows:
• rn_tokenID is the string that identifies a token. It is usually meaningful to a user in order
to facilitate the debugging.
• rn_probability is the probability of occurrence of a given token. It typically ranges from
o to 1. Log-probability can also be used, in which case the value will be negative.
• rn_tirneStarnp is the time at which a token was emitted. There is no synchronization
between different computers distributed over a network, but it would be possible to impIe-
ment a time synchronization service as in VRPN [90] or CORBA's ORB time service [68].
• rn...source is the source that identifies from which input modality a token originates. This
attribute is typically used when tokens are integrated for multimodal interaction. It is
a way of knowing if a token was emitted by speech, gesture or another input modality.
When coming from a remote computer, "remote_" is pushed at the front of the source
string in order to identify that the token's origin is not from the local hosto
• rn_data is used as a generic parameter container of type AData. Any kind of AData,
which implements most commonly used types, can be stored as a parameter in an input
token. This allows for passing the context of occurrence of an event as weIl as any other
information that would potentially be of interest for the token parser and the action's
invocation method.
• rn-IleedPublish identifies whether or not a token needs to be published on the network
if there are client connections. By default, a token is not published because it might not
35
4.4 INPUT TOKENS PRINCIPLE
be relevant for remote virtual environments. The publication variable can, however, be
set with the publish method.
• m-Ïnstance is a token instance number that is set in order to keep track of the sequence
in which they are created by the different input modalities and modalities integrators.
The input token interface allows for serialization and deserialization to or from AData. This
feature is particularly useful when reading tokens from streams such as files or network since AData
are convertible to and from XML. Insertion and extraction operators are provided in order to write
a token to a stream or read a token from a stream.
The rationale behind the use of such generic tokens is that the types and members of data
structures, which have to be specified at runtime, are not known beforehand. Since AData allows
for the composition of most commonly used data structures, it is justified to accept the overhead
that such a container brings to the system in order to gain a general knowledge representation.
4.4.1. Input Token Example. An example of input token use is the following: suppose
a user performs a circular shape with the mouse. The MouseHMMGestureRecognizer c1ass will emit
a "moveCursor" token every time the physical mouse is moved. The data parameter associated
with this token is the current position (vector of length 2) and the identification string of the
mouse that moved.3 When the "circ1e" gesture is complete and has been recognized with the
method described in Chapter 3 as being an actual "circ1e" gesture, the corresponding token will
be emitted with the center and amplitude of the performed gesture stored as parameters in the
AData container. In the case of a "circ1e" gesture, the amplitude parameter is the diameter of the
performed circ1e. The two parameters are represented using double precision numbers, the center
being a two-dimensional vector and the amplitude a single number. These are stored in a Map that
has String as keys, respectively "applicationPoint" and "amplitude". After an idle time, typically
half a second,4 a "still" token will be emitted with the current mouse position as a parameter. AIl
previously mentioned emitted tokens will have their source attribute set to "mouseTracker" in order
to identify their origin.
Another example of a token emitted from a speech recognition algorithm would be the following:
suppose the utterance "Delete the blue chair" is recognized. The speech recognizer is assumed able
3The format of the mouse identification string is "mouseX" where X is the mouse identification number. 4Half a second is an empirical arbitrary value.
36
4.5 WORLD, WORLD ENTITIES AND HOW TO MANAGE THEM
to separate a sentence in its words and parse a grammar that has a dictionary containing word
categories. Therefore, the emitted token will be identified as "delete" and the parameters will
have the value "blue" and "chair" corresponding to "objectType" and "colour", as parsed by the
grammar. The rest ofthe processing is to be computed in the interaction manager, which will invoke
actions associated with the "deI ete" token on corresponding entities.
4.5. World, World Entities and How to Manage Them
A conceptual model of the virtual world is needed to help the framework's developers under
stand how to approach a given problem, such that they can adapt their specific application to
the current system. The proposed world model is inspired from APIA (Actor-Property-Interaction
Architecture), which is a conceptual model developed at Laval University by Bernier et al. [3].
4.5.1. Components Description. In APIA, Actors are abstract virtual objects (e.g. sub-
marine), and do not contain any concrete attribute. Properties implement the actor's attributes
and can be of any type, as specified by the AData (e.g. mass, volume). Interactions are the links
between the actors (e.g. Archimedean force), where calculations occur in order to further modify
the properties. Other characteristics that define an actor are its characters. A Character is a group
of properties that define the dynamic characteristics an actor might be able to show (e.g. float
able). Characters also define other relationships between interactions and actors that allow for more
general management of the invocation of interactions.
However, as the APIA architecture is still under development and remains too complex for
the current intended applications, we devised a simpler alternative for our purposes. Based on the
APIA concepts, a simplified conceptual model for the virtual environment was built, though more
restrictive for multimodal-based world management (Figure 4.8). Actors become the World and
WoridEntities, Properties remain the same concept, interactions become Actions and 8ehaviours share
sorne characteristics of APIA's characters. A detailed description of the pieces that compose the
proposed virtual world model is presented in the enumeration below:
• WorldEntity: a world entity is the world itself and any constituent of the world, as seen
in Figure 4.8. It is something that must be rendered by an output modality, and its
properties modified by actions.
37
4.5 WORLD, WORLD ENTITIES AND HOW TO MANAGE THEM
Ac/ion v.m_activationTokan: std::string v.m_bahlNiourList: std: :set<std::string> v.m_name: std::string v.m_nowStata: std::string #m_actionData ~ v.m_stataCondition: std::string (fromASTL) v.m undoabla: bool
#m-prope~ ·«sb<llract» doAppMJ .«con5t» hlMlBehlNiour() .<<'<irtual, const» validataEntitia.O Work1EnMv .< <'<irtual» addT oUndoListO
.~ v.m_bahlNiourList: std: :map<std::string, Data::AOata> v.m_nam. : std::string v.m-pubic : bool
World ~ instanca: int v.m_entitiesMap: std::map<std::string, WorldEntity'> 1 ~ -render : bool v.m_hook : WorldHook • ""':resourcaLoek.d: int
•• ddEntityO -f> "ddBehlNiour() ~niO .addPropertiesQ "oadHookO .... '" ~.kePublieO .... m"'.EntityO .«ab<;/racl» rendet() .«00n8t» getEntityO .<<oonst» getPropertiesO .«virtual» renderQ .«eon8l» getPropertyQ
.«eonst» hlMlBehlNiourQ
.«eonst» isPublieO
.«virtual» finiO
.«virtual» initO
FIGURE 4.8. World, world entities and actions class structure
• Property: a property is a named AData that is aggregated in a world entity in order to
provide it with sorne attributes that will be exploited by the actions, as weIl as by output
modalities in the rendering method. Properties are stored in a Map whose key is the
property's name contained in a character string.
• Behaviour: a behaviour is defined as the way a world entity should react to input tokens.
It is a characteristic that defines the behaviour of an object in the virtual world. One
or more properties are associated with every behaviour, such that an entity with a given
behaviour will necessarily own those properties.
• Actions: an action may be viewed as the implementation of one or many behaviours. It
is where the actual calculations and property changes are made. Actions are applied to
entities by taking input token's AData attributes and adjusting the entity's properties
according to what the doApply method implements. An action is associated to an input
token's identification string, which will trigger its invocation under certain conditions when
emitted.
4.5.2. Examples. An example of a configured three-dimensional model world entity can be
seen in Figure 4.9. It owns the behaviour "pickable", "translatable", "deletable" and "rotatable".
It should be clear that these behaviours are respectively associated to actions "pick", "translate",
38
4.6 INTERACTION MANAGER
3D model
""""-8oho,;o", " El t",,,,Iata~. , ........
deletable
Property: type
\ .. , ............... , ....... , ........... ,_ .. r-----<-=----,
picked: position: Boolean Vector3
rotatable
rotation: Vector3
FIGURE 4.9. 3D model configuration example
"delete" and "rotate", among others, sinee the action to behaviour mapping is not neeessarily one-
to-one. Properties are also part of the requirements that will allow the previously mentioned actions
to be applied to the 3D mode!. In the current example, three different behaviours rely on the
"position" property, which in fact is unique. This property is therefo~e shared between the different
actions.
A concrete example of an action is "rotate" that is activated by the "moveCursor" token when
in "rotating" state. In the doApply method, the mouse position is retrieved, having prior knowledge
that the given parameter is stored under the name "position" in the input token's parameters. The
"rotation" property of the affected world entity is also retrieved and the correspondenee between the
position attribute and the rotation property is made, with the appropriate calculation and mappings.
As noted in the previous paragraph, a considerable amount of information has to be known
before using the proposed architecture. Input modalities, actions, world entities and behaviours
have to be consistent in the way each of their properties' and parameters' names and types match.
Documentation on the currently implemented components is presented in Annex B.
4.6. Interaction Manager
The interaction manager is the heart of the framework because it is where decisions are made
as to whether or not an action will be applied to an entity given an input token. The interaction
manager's principle was briefly discussed in earlier paragraphs, but a more detailed description of
the proeess by which actions are triggered by input tokens is presented in this section.
39
4.6 INTERACTION MANAGER
r---~-c=------"-----
WoIIdEnlity ""'_blhaviourList: std::map<std::stri ... ."." mutex : ACE Thread Mut .. ifm:name : sld::s1ring -w.mJlublic: bool ","_rasourceLocked; int
~etMute.O ~OCkRosourcoO "unlockResourceQ ·«sbat'8Ct» rendet() "«eonst» getPropertyO .«ecnst» have8ehlMour() ~<conlt» havaSehaviour() .«const» isPublicO
IntoractionManager Action iIm_IctionMutex: ACE_Thread_Mutex
'ifKn actiYeActions: std::list<ActDrf> """ -redoActions: std::list<ActionTrackef> .. w.m=stll.: std::string "mJnte"~ionM.n.g.r .:?~,."t» doAPWQ ffm undoAction.: IId::list<ActionTracker> ~ __ ---j .«const» getActNationTokenQ .-m:actionUst: std::muttimap<std: :string, Action·) .«const» getNewStataO
""_networkManager: NetwolkManager .«eonlt» getStateConditionQ 6 .«eonll» haveSehaviourO --.,--- f'parseTokerO <<friand permission» .<<\IirtuaJ, const»'t'IlidateEntitiesQ
"""cO .<<-.irtual» addToUndoUst() "ddTokenO ~nO ~nitQ ~ockEntly() "'edoLastActionO 'ndoLastktiœ() "nlockEnt~y()
ActionTracker ~_oldProperties : std::map<std::slring, Oata::/ ...
.«~rtu.I>~ undoAndUpdeteO
InputToken "'" ~ instance; int
"'" w.m - n •• dPublish : bool w.mJ>rob.bil~y : double ",,"_source: std::string IfnUimeStamp : unsigned long w.m_tokonID: std::string
·.ddDataO .«const» getOataO
FIGURE 4.10. Interaction manager and auxiliary classes
As can be seen in Figure 4.10, the InteractionManager class is linked to InputToken, WoridEntity
and Action classes. The Interaction Manager class inherits from ACLThread that is an object-oriented
implementation of a thread from the ACE OS wrapper library [81]. The ACLThread class provides,
among others, methods for the management of a FIFO data stucture in which messages are queued
using the putq method and dequeued using the getq blocking method, which unblocks when a new
message is put in the queue. The getq caUs are invoked from the svc method, which is the thread's
entry point. The svc method does not exit the loop until a "NULL" message is put in the queue,
which happens when a caU to fini is made.
Algorithm 4.1 Interation manager's token parsing algorithm
Get a token from the queue (getq) {An input modality asynchronously puts a token in the queue} parseToken for every action do
if token ID = action activation ID and action activation state = system state then active actions list {::: action
for every active action do for every world entity do
if world entity owns all behaviour and behaviour data match then potential entities list {::: world entity
valid entities list {::: validate the potential entities toward the action for every valid entity do
Apply the action to the current entity Publish the token
As can be seen in the pseudo code of Aigorithm 4.1, the getq method returns when an input
token is put in the message queue, which consequently executes parseToken. This latter method will
40
4.6 INTERACTION MANAGER
retrieve the tokell's identification string and compare it with every action's activation string. Wh en
an action's activation string matches the token's identification string, it is put in a list of active
actions that could be applied if the action's invocation state also matches the current system's state.
This interaction manager's "state" attribute is the mechanism that was chosen in order to impose a
context on the action's execution. The state's internaI storage is a character string that is set from
the configuration file and adjusted every time an action is applied successfully. If the state condition
is set to "any" , the action is applied regardless of the system's state. An example of a state condition
is "translating", which allows actions such as "translation" and "drop" to be applied.
The next step in the validation process is to find the world entities that are possibly affected by
the active actions triggered by the incoming token. The first test is to verify if the entity owns every
behaviour the action needs in order to be applied. An entity should possess all behaviours that an
action necessitates, otherwise errors will occur at run time if properties required by the action are
not owned by the world entity, or if their types do not match. Secondly, the data associated with
each of these behaviours and the input token's data must be equal in order to validate the action's
execution. An entity's behaviour data is stored internally in a Map whose key is the behaviour's
character string. A typical example of behaviour data use is when drawing or moving cursors on
the screen. Tokens are sent every time a mouse event occurs with one of their parameters being the
cursor's identification string. The cursor drawing action has a behaviour data that should match the
token's cursor identification data string. This verification ensures that if there are multiple mouse
instances, a virtual cursor will move only when the identification strings match.
A valid entity is additionally one whose properties match the action's prerequisites for the
invocation. Once aIl the possibly valid entities are verified towards the input token, they must be
confirmed towards the action. This pro cess takes place in the validateEntities calI of the Action class.
The rationale for the validation process is that several actions might have the same criteria as to
which entities among several are suitable for modification. It is therefore an obvious method of code
reuse and easier error tracking since the next step for the action execution process is to apply the
action separately to every valid entity. The action's invocation is the last step of the data pipeline
that ranges from the input modalities to the virtual world. This is where input token's data members
are considered, and used to modify the entities' properties according to what the action actually
41
4.6 INTERACTION MANAGER
Data saving process ,----------,
Entity
Properties
Q .... OQ ... -"'-. -'"
........... L-______________ ~
To undo list
Action tracker
.:.~==~------_.--/
Undo process From Entity
undoto
redOIi~ Properties "-
Action tracker
lP°j) q .. q .. .~ ................. - -.. -........... ---f/ /
--~~-----
FIGURE 4.11. Data saving and undo processes
implements. Finally, an input token can be published on the network if peers are connected. This
last step will be explained in Section 4.8.
The interaction manager class provides other utilities that help entity management. Entity
locking and unlocking methods are provided for actions that require an entity's exclusivity to be
able to lock and unlock it. Locking is done per network instance to ensure data coherence, which
means that requests for locking pass through the network communication process, described in detail
in Section 4.8.
The last feature the interaction manager provides to the system is an undo/redo facility. Since
the current system can be used for human-computer interaction, it is important to provide a way
for users to undo actions that are unwanted or incorrectly performed after recognition. Undoing
actions is the consequence of a basic Hel principle from Nielsen [66J who urges designers to "help
users recognize, diagnose, and recover from errors". Providing undo facilities helps users to recover
from errors since they can go back in the history of applied actions.
As seen in Figure 4.10, the inte~action manager contains two lists of pointers to the ActionTracker
class. Instances of this class are used to store entities' properties temporarily before an undoable
action is applied to the entity. Figure 4.11 shows the pro cess by which properties are saved and stored
in the undo and redo lists aggregated in the interaction manager. A pointer to an ActionTracker
42
4.7 TAKING ADVANTAGE OF THE CONTEXT
object is pushed on the undo list when a user performs a reversible action. Similarly, the last
ActionTracker object in the undo list is popped, updated and pushed back on the redo list when an
action is undone. The redû list is emptied when a new action tracker is pushed on the undo list.
4.7. Taking Advantage of the Context
Research has been published on the influence of the context on recognition rates and system
performance [16,50,62,65,80]. In fact, the context of an application provides important clues
as to what a user could be doing in the virtual environment at every moment. The context can
be defined as the information set that influences observations. The former could be the user's
orientation towards objects, the objects' position and state in the virtual world, the last performed
action or any clue that would help the system predict which actions are the most likely to occur
next. Observations, on the other hand, make up the data set that originates from input modalities
in order to come to a decision at a given moment. For example, if it is known that a user selected
a virtual object in the world, it is likely that the next commands will be applied to that object.
Those commands are known since an object has a finite set of applicable actions determined by the
associated behaviours.
Many techniques exist in order to retrieve the context out of a virtual environment. It is possible
to consider the user's state progression and simultaneously examine the objects' state in order to
draw a relation between the two that would define whichever action is likely to happen. For example,
it could be observed that a user's hand targets a defined virtual object just by looking at the hand
position's trajectory in space, and interpolating the target position the user is trying to reach. This
method is interesting because it uses the movement dynamics in order to predict actions that are
likely to happen. However, this technique is only applicable to gesture-based interactions since it
would not be possible to obtain any kind of context from, for instance, raw speech dynamics . Using
a grammar can be an interesting way of providing context to a system. There are many types of
grammars, among which sorne are stochastic and others are sim ply implemented in the style of a
deterministic finite state machine (FSM).
As seen in Figure 4.12, the input modalities and the interaction manager take part of the
context grabbing process. It is the input modality's role to call the context grabber's method
getEmissibleTokenslDs, which is meant to provide a list of tokens that can be emitted, given the
43
4.7 TAKING ADVANTAGE OF THE CONTEXT
+rn_ni ara cli onManager InteractionManager IlrpulModalty 1
~ aclionLisl: sld::multimap<sld::slring, Aclion*> 1
+m _inpulModalilies
t #m_interactionManager
+m_contextGrabber ConlexlGrabber
.«const» gelEmissibleTokenslDsQ
FIGURE 4.12. Context grabber's class interface
current context. The information collected from the context grabber is typically passed to the
recognizers in order to influence the initial probability of known models.
The HMMGestureRecognizer is the currently implemented recognizer that takes advantage of
the context. Hidden Markov models of known gestures are stored in a list and identified by their
corresponding emitted tokens. When the context information is not used, aIl models have the same
initial probability of occurrence in the hypothesis generation stage. Therefore, if two gesture models
are similar enough to confuse the recognizer with a given gesture sequence, recognition errors will
arise more often even though only one among the two gesture models would have made sense to
occur. However, when context is provided, constraints are applied on initial hypotheses in order to
restrain the number of models for which an associated gesture can occur. There are two positive
effects of that restriction: first, fewer recognition errors are likely to happen. Second, the recognition
process will compute faster because instead of generating hypotheses for every model, only the ones
that can occur, given the context, are considered, thus reducing the amount of hypothesis likelihoods
to compute.
The question now is how to know which tokens are likely to he recognized, or in other words
how to build the context? In the current framework, the context grabber's implementation uses
a finite state machine (FSM) in order to get the conditions in which events can occur. The state
attrihute is stored in the interaction manager object as a character string. The default state of the
machine is "idle" , from which actions can take it to another value while being applied to the world
entities. The state condition that an action must meet in order to be part of the context is stored
in the Action class as a character string. The state in which the interaction manager is to be set
after an action's invocation is also stored in the Action class instances. The latter two attributes are
user-defined.
44
4.8 NETWORK MANAGER
For every calI ta getEmissibleTokenslDs on the context grabber, the action list is parsed. The
activation token of actions whose state condition matches the interaction manager's current state is
pushed on to the list of token identification strings that can be emitted at the moment. It should be
noted that if an action can be triggered regardless of the interaction manager's state, its activation
state should be set to "any". The activation token of an action having "any" as its state condition
is always added to the list of tokens that can be emitted, hence, is context independent.
4.7.1. Example. A concrete example of the context grabbing pro cess using gesture input
modality happens when a virtual object has just been picked for translation. The interaction man
ager's state is immediately set to "translating", allowing the "translate" and "drop" actions to be
applied. The "translate" action's role is to make the virtual object follow the virtual cursor, which is
being displayed by "moveCursor" and "traceCursor" actions. When the "drop" action is triggered,
the interaction manager's state is set back to "idle". While in "translating" state, the virtual cursors
indicating the trackers' position keep moving since the corresponding actions are activated regardless
of the interaction manager's state, their activation condition being set to "any".
It would be possible to build more sophisticated context grabbers using information originating
from the virtual world as weIl as from the user's status. Adding a grammar would also improve
the context grabbing feature of the system since it would allow for more general relations to exist
between utterances [62]. For example, suppose a speech recognition system in which every virtual
object would have a descriptor word naming it. The speech recognition system would therefore
attribute a larger start probability to nouns corresponding to objects present in the virtual world.
Likewise, the verbs' initiallikelihood would increase for the ones whose corresponding actions can be
applied to the virtual world's objects. Additional information on the context could also be of interest
given other modalities such as a gaze or eye tracking, so that the system would know where the user
is looking at every moment, putting more constraints on the most likely actions to be triggered.
4.8. Network Manager
One of the framework's design goals is to provide facilities to share a virtual world between
several people geographicaIly distributed over the planet. Networked virtual environments (NVEs)
are known to offer services that interconnect remote environments and allow users to take part in
coIlaborative or competitive experiences.
45
4.8 NETWORK MANAGER
ServerAcceptor
1 ClientConnector 1 v.mJlort: int NetworkManager
1 ·«virtual» openConnectionO 1 "'«Yirtual» svcQ (TromLoglClIVIow)
.«virtual» startQ v.m_networkRequestsMap: std::map<std: ...
1
.«virtual» stopQ ""equestlDQ
ConnectionHandler 1 ·connectToQ
1
·grabResourceQ
·svcQ Iffin. dataServer ~sMasterServerO
",«virtual» handlejnputO ~akePublicO
.«virtual» closeQ V ~akeRequestQ
.«virtual» finiO Connectionlnstantiator ~ublishPropertiesQ
.«virtual» openQ ~ublishTokenQ
~ .... _"m., .'o"ri •• ~leaseResource() .«virtual» addConnectionHandlerQ
·getNameQ .«virtual» finiQ .«Yirtual» startO
/ .«virtual» handleReceiYedDataQ
.«virtual» stopQ .«virtual» initQ
#ni connectionMtager
.«Yirtual» removeConnectionHandlerQ
ConnectionManager v.m_connectionsList: std::list<ConnectionHandler">
·sendToClientsQ .«abstract» handleReceiYedDataO .<<Yirtual» addConnectionHandlerQ .«virtual» finiQ .<<Yirtual» initQ .<<virtual» removeConnectionHandlerQ
FIGURE 4.13. Network manager and surrounding classes
Several systems have been developed in the past that provide networking services, which allow
for virtual world sharing and coherence [17,18,41,93,100]. Most of them aim at providing ways
to ensure data coherence between different virtual world instances. Researchers have developed so-
phisticated synchronization systems and data caching in order to make use of network resources as
efficiently as possible, and reduce network latency. In the course of this thesis, entities synchroniza-
tion and world coherence was implemented, taking as an inspiration the work of MASSIVE-3 [42].
Methods are provided that send events over the network as weIl as entities' properties in order to no-
tify remote instances of status changes. Entities synchronization is implemented such that two users
will not be allowed to change an entity's property simultaneously. The proposed class architecture
can be seen in Figure 4.13.
NetworkManager is the interface class that the Instance object has access to. It inherits from the
ConnectionManager class, which holds a list of connection handlers. ConnectionHandler is the class
whose object instantiations will receive data from or send data to peers. To initiate a connection, the
network manager creates a ClientConnector that connects to the specified server with the connectTo
method. The connection handler then adds the newly created connection to the list and starts a
46
4.8 NETWORK MANAGER
receiving thread that waits for incoming data on the socket until the connection is lost, after which
it un-registers itself. When a peer sends data, the thread is waken up and the connection manager's
handleReceivedOata caIl-back method is invoked.
In the present case the concrete implementation of the connection manager is the network
manager whose data handler method is executed. This latter method rebuilds packets that arrive
incomplete due to packet splitting over the network, and pro cesses the valid incoming data. Since
the network managers exchange data in raw XML format, the detection of packet ends is effortless.
It is also trivial to interpret the XML packets because they own an attribute in their root no de
that specifies the packet type. Currently supported packet categories are "token", "props", "lock",
"unlock" and "IDRequest", which are described below:
• token: contains an input token that was serialized and sent over the network
• props: contains sorne entity's properties that were serialized and sent over the network
• lock: an entity locking is requested
• unlock: an entity unlocking is requested
• IDRequest: a new client is connected and requests its identification number
It is possible to register other types of packets with a NetworkRequestHandler that knows which
packet type it is meant to receive. Two options are available for the caller: wait on an event to
be signaIled when a proper packet arrives, or register a caIl-back function to be caIled when the
corresponding packet type is received. Such packet type registration is used when waiting for calI
replies. For example, when a new instance asks for its instance ID, it registers a packet of type
"IDRequesLack". The reply packet contains an AData member handled by the receiver. In the
present case, the contained data is the actual requested identification number.
In order to create a server, a calI to the makePublic method of the NetworkManager class has to
be invoked, which st arts a new thread and listens on a specified network socket port. When a client
connects to the port, the server accepts the connection and creates a connection handler that will
eventuaIly be used to dis patch received data. Clients and servers are respectively implementations of
ACE's pattern classes ACLConnector and ACE...Acceptor [81]. These classes provide utility methods
that manage basic network socket functions as weIl as handlers for incoming events.
47
4.8 NETWORK MANAGER
Interaction
.M.ini.w Lack Entity
0 Try ta lack entity
Try ta lack entity until master server;
~ : Grab lack
~
~ctian
Publish Taken ar praparties
Send data ta paars
Send data ta ather peers
~
Parsa takan ar update praparties !
Unlack enUy "'.
""1 Asynchranaus 1 Try unlack .. cali
-u
FIGURE 4.14. Network manager's sequence diagram
Given the network structural design described ab ove , an explanation of how the coherence
between several virtual environments is managed can be seen in Figure 4.14. The network manager
is actually used to ensure consistency between several replicas of a virtual world shared over a
network. Two types of instances exist, which are clients and servers. Clients can connect to servers
and then become servers themselves, to which other clients will be allowed to connect. The "master
server" is the one that is first instantiated, run, and which does not connect to any other server
thereafter. It is first responsible for assigning each client an identification number that is used to
know from which peer data packets originate. The second utility of the master server is to manage
the entity locking strategy.
48
4.8 NETWORK MANAGER
Before modifying an entity's properties shared amongst multiple instances, it is necessary for an
action to "lock" that entity such that only the instance that owns the lock will be able to modify the
given properties. The Iocks are managed by the mas ter server, which keeps an internaI representation
of the world entities locking status. When a "lock entity" call is invoked on a client, a lock request
is sent to the associated server, which requests its own server until reaching the "mas ter server".
A message is sent back, notifying the caller if the entity was Iocked. A drawback of this Iocking
strategy is the lack of fairness among instances sin ce entities are locked on a first come first served
basis. Another problem with this locking scheme is the time it cou Id take to get a response from the
master serverj if the number of peers that a packet needs to go through to finally reach the master
server is too important, the Iatency might be unacceptable. The interaction manager would then
be locked while waiting for a master server's response, which is an unwanted behaviour. Solutions
however exist, as proposed by Singhal and Zyda [84], but were not implemented in the course of
this thesis since the current work is not focused on network performance, but rather on the software
framework's generality.
After an action's execution, the input token or the entity's properties can be published ta the
other instances, depending on the event that occurred. Generally, when the action does not involve
an entity property change, the associated input token is published on the network if it is of interest
to other peers. Likewise, when an entity's properties are changed during an action invocation, they
are serialized and sent to the connected clients. Once a peer's interaction manager receives an input
token coming from a remote location, it parses it as if it was coming directly from an input modality.
However, a mechanism is implemented in order to warn the remote interaction manager that the
token does not originate from the local instance by putting an indication in the input token's source
attribute. On the other hand, when properties are published, the corresponding entity is directly
updated with the new values. It should be noted that property updates and token parsing in the
remote instances are asynchronous. There is therefore a possibility of loosing the synchrony for a
short period of time, which is acceptable in most cases. The entity unlocking pro cess follows the
same pattern as the locking strategy.
The presented network communication algorithms would benefit from optimization in terms of
quantity of data transmitted over the network. Raw XML format is not the most compact form of
data, which leads to overuse of bandwidth. We performed a rough estimate of the bandwidth needed
49
4.9 XML CONFIGURATION FILE
for transmitting XML data, and found that around five times more bytes must be transported than
when using raw data, based on the specified format. Preliminary tests have been performed by
running multiple instances of the framework on a local area network, showing that the developed
system is able to maintain the virtual world coherence for multiple distributed instances. More
exhaustive testing of the network communications is left as future work, and should ideally involve
communication over the Internet in order to verify the algorithms described above.
4.9. XML Configuration File
This section presents the XML configuration file that a user builds in order to fit the needs of a
given application. In order to be exploited, the various software components must be initialized by
users either with hard coded values or through an XML configuration file parsed by an appropriate
interpreter. The former method does not offer the same flexibility as the latter, because the appli
cation should be recompiled every time a value is changed. However, the use of a configuration file
allows for flexibility and ease of use, such that non-experts wou Id be able to build one, eventually
with a GUI.
The XML configuration parser must be invoked in order to read the specified file and create
a DOM representation of the configuration parameters. Each file's section is then analysed, which
leads to software modules and objects instantiation according to the specified values. The XML
configuration file's author must know beforehand what content to specify in the file, requiring the
available components to be well-documented as to which configuration properties they expect from
a user. The different parts that compose the XML file are as follows:
• Input modalities: contains aIl the input modalities' specifications
• Output modalities: contains aIl the output modalities' specifications
• World: contains the world entities' specifications as weIl as behaviours owned by the
world object
• Grammar: contains all the actions' specifications
• Network: specifies the network parameters, being clients or servers
50
4.9 XML CONFIGURATION FILE
The next sections describe the format of each XML file component by presenting fragments
of a concrete example that is currently implemented. The experimental application shows a three-
dimensional world in which 3D models are placed and their properties modified using a mouse-based
gesture recognizer. For a complete XML configuration file, see Annex C.
4.9.1. Input Modality Node. An "Input Mo dalit y" node is meant ta create an input
modality that is added ta the input manager. The next XML code sequence shows an example for
the creation of a mouse-based dynamic gesture recognizer:
<InputHodality name-"mouseGestures" type-"DynamicGestures" >
< AXML > <map name-" instance Il >
<string name-"type" value-"HouseBasedHHHGestureRecognizer" />
<map name-"data">
<string name-"dataFile" value-"gestures .ges"/>
<int name-"smoothingBuffer" value- H3"/>
<int name-"buffer" value-"100"/>
</map>
</map>
<string name-"mode" value-ttevents"/>
<int name-"frameRate" value-"30" />
</AXHL>
</InputHodality>
The base node has two attributes; one is the input modality's name and the other its type. The
"type" attribute is passed ta an abstract factory class in arder ta instantiate an input modality of
the corresponding brand. The factory then tries ta load a dynamically linked library that has the
same name as the requested type, plus the file extension.5
The first child node has "AXML" as its tag value. This no de is meant ta provide initialization
data ta the newly created input modality. The XML file parser will create an instance of an AData
that will be filled with the data contained between the opening and closing AXML tags. In the
present case, the created AData is a Map whose two template arguments are of AData type. The
map's keys, which are of concrete type String, are found in each child node's "name" attribute.
The "type" child no de is the parameter for another factory, this time for the dynamic gesture's
abstract factory. In this case, the factory tries ta load a dynamic gesture recognizer of type Mouse-
BasedHMMGestureRecognizer, which is ta be found in a dynamically loaded library. The sub-child
5The standard naming convention adds a "d" at the end of the library name if the program is compiled in debug mode.
51
4.9 XML CONFIGURATION FILE
data map contains initialization data for the mouse-based gesture recognizer, namely the data file
that contains gesture models, and the lengths of different buffers used in the recognition process.
If needed, it is obviously possible to add other data, as long as classes that use it are modified
accordingly.
4.9.2. Output Modality Node. The "OutputModality" node is present in the configu-
ration file in order to instantiate the output modalities that will be added to the output manager.
Here is a typical example of such anode:
<OutputHodality name-"userlnterface" type-"Display2D" >
<AXML>
<int name-"frameRate" value- n 30" />
<string name-"behaviour" value-"drawOnTop"l>
</AXML>
</OutputHodali ty>
In the same flavour as the input modality node, the output modality has two attributes; one
that specifies the name and another that specifies the type of output modality a user wants to
instantiate. Output modalities are also created with a factory that will use a dynamically loaded
library if it cannot find the concrete type of the object in the known objects list. The AXML data
node contains data that is to be passed to the modality's initialization function. In this case, a frame
rate as weIl as a behaviour are specified, which means that in order to be rendered in this 2D view,
world entities will need to own a behaviour called "drawOnTop".
4.9.3. World Node. The "World" XML node specifies the world's content as weIl as entity
types and behaviours. Entities are contained in "WorldEntity" nodes, whereas entity behaviours
are stored in the "Behaviour" nodes.
<World>
<Behaviours>
<Behaviour name-"resetable" />
<Behaviour name-"placeable" 1>
</Behaviours>
<WorldEnti ty name="traj ectoryRight Il typez "mouseTraj ectory" > <AXML>
<map name-"color">
<int name-"r" value-"Q"/>
<int name-"g" value-"255"/>
<int name-"b" value-"O"/>
</map>
<int name-I'length" value- 1I 100"/>
52
< IAXML > <Behaviours>
<Behaviour !lame- "mouseTraceable" > < !XML >
<string name-"ID" value-"mouse1" />
</!XML>
</Behaviour>
<Behaviour name-"drawOnTop" 1>
</Behaviours>
</WorldEnti ty>
</World>
4.9 XML CONFIGURATION FILE
In the previous XML snippet, behaviours "resetable" and "placeable" belong to the world.
These two behaviours are used in order for actions "reset" and "place" to be executed on the World
object. Actions that do not involve particular entities, or that create new ones, should always be
executed on the world. The declaration of a world entity uses the same principle as an input or
output modality for its creation. Specifying the type invokes a caU to a factory that instantiates an
object if the type is known. Data that will belong to the entity is then specified in the AXML node.
Converted to AData, AXML nodes are added to the entity in the form of properties. For instance,
the entity named "trajectoryRight" will, after its creation, own the properties "color" and "length"
that are to be used in the rendering process. The "color" property is in fact the colour the mouse
trajectory will be displayed in when a cursor is moving on the screen, and "length" is the maximum
number of displayed data points.
Behaviours are specified as children of anode caUed "Behaviours". A behaviour is not a
con crete class, but rather a character string and its option al AData contained in a Map that are
stored in every entity. In the above example, the entity owns the behaviours "mouseTraceable" and
"drawOnTop". One thing to notice about the "traceable" behaviour is the AData that is associated
with it. This AData member is in fact one that the input token must match in order for an action to
be applied to the entity, as explained in Section 4.6. In the present case, it defines the identification
string to which the incoming mouse cursor identifier should be equal. Since there may be more than
one cursor's position sent, it is the way that was found in order to restrict the action's invocation.
As for the behaviour "drawOnTop", it means that the "userInterface" output modality will enable
the rendering of the mouse trajectory since the required behaviour matches. The "userInterface" in
53
4.9 XML CONFIGURATION FILE
the present case is the last to be drawn, which results in drawing the corresponding entities on top
of the others.
4.9.4. Action Node. The execution of actions is the core of the framework since it makes
the relation between the inputs and the virtual world's status. The following XML file segment is
the description of an action:
<Action type-"pick" activationToken-"translate" when-"idle" becomes-"translating" > <Behaviour name="translationPickable">
<AXML>
<vector name-"position" />
<bool name-"picked" />
</AXML>
</Behaviour>
</Action>
An action is created by the action factory, which instantiates an object whose type is found in
the attribute "type". The attributes also provide the activation token parameter, the state condition
("when") and the new system's state ("becomes"). Behaviours and associated properties are then
specified. In the current example, the action is "pick", which is meant to pick a virtual object.
It is activated by the "translate" token ID when the system is in the "idle" state, and puts it in
the "translating" state. Properties that are associated with the behaviour "translationPickable" are
"position" and "picked".
4.9.5. Network Node. In order to configure how the current instance will behave regarding
its network connections, a "Network" node can be included in the XML file. The syntax is the
following:
<Netllork>
<Connection type-"server ll port-"76849" />
or
<Connection type-"client" serverName="localhost" port-"76849" />
</Netllork>
The network manager (see Section 4.8) is configured given the "type" attribute of the "Connec-
tion" node. In server mode the "port" attribute specifies the port number the server socket should
listen on. In client mode, the "serverName" attribute specifies the host name of the server the client
should connect to, and the "port" attribute specifies the associated port number. If the "Network"
54
4.10 CONCLUSION
no de is unspecified in the configuration file, the default behaviour is to start a server that listens on
port 20202.
4.9.6. Discussion. As a conclusion on the XML configuration file, it should be noted that
it is the user's responsibility, when writing the file, to ensure that it is coherent and consistent with
the available resources and classes. It would however be possible to build a graphical user interface
that would allow configuring and certifying data coherence after the composition of an XML file.
4.10. Conclusion
In conclusion, a general and flexible framework for multimodal interaction was presented. The
software framework allows for several different components to be loaded at runtime in order to
meet the user's specifications, being hard coded or defined in an XML configuration file. A virtual
world model is defined as being composed of world entities, each having their own behaviours and
properties. Event-based input modalities emit input tokens that contain an event specification as
weIl as the parameters associated with it. The interaction manager applies actions on matching world
entities, given an input token. Context is provided as to which input tokens are the most likely to
occur, given observations on the world and the user. Networking facilities are implemented in order
to share a virtual world between multiple geographically distributed users. An input modality of
dynamic gestures was used in order to demonstrate the framework's flexibility as weIl as a basic
application, which will be described in the following chapter.
55
CHAPTER 5
Results and Discussion
In order to test the decisions made throughout the design process, experiments and performance tests
were conducted on the software framework and the gesture recognizer. Experiments on the gesture
recognition's reliability justify the chosen feature vector of a gesture data stream. A functional
application, implemented in order to demonstrate the framework's utilities in allowing for general
multimodal applications, is described. Performance tests show that the interaction manager is not
a bottleneck for having a large amount of entities in the virtual world while maintaining adequate
performance on the output side. A discussion of the framework's extensibility and flexibility is then
presented, based on the experience acquired while developing the application.
5.1. Continuous Dynamic Gesture Recognition under Several Conditions
Continuous dynamic gesture recognition is the core input modality implemented for this thesis.
It is therefore important to test and justify the basic design decisions. The choice of a proper feature
vector is the basis of the gesture recognition system since the features determine which information
is important in the input signal coming from the capturing devices. A mouse-based input modality
was used in order to characterize the various feature vectors considered. Gesture recognition rate
was measured with multiple position capturing devices: a mouse, a P5 data glove and a vision-based
hand tracker, all sharing the same interface. It should be noted that the gestures on which the
experiments were conducted were arbitrarily devised. In a final application, users would have the
freedom to do so themselves.
5.1 CONTINUOUS DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS
\., ... _ .... +--...... . 4==._ .... PC=--
...............
Place Pick Delete
FIGURE 5.1. Gesture set used for recognition tests
5.1.1. Choice of Feature Vector. The choice of the feature vector is of crucial importance
for a gesture recognition system because it is the only data, from the raw input stream, sent to the
recognizer. Several different feature vectors were taken into account, using a mouse-based gesture
recognizer input modality, in order to determine the most suitable one for the kind of application in
which the framework is intended to be used, that is, iconic and deictic gestures.
Certain feature vectors were immediately rejected because they were not compatible with the
projected framework's applications. For instance, the position vector cannot be used in the current
system because a gesture would always have to be performed at the same location in order to be
recognized. However, it is contrary to the familiarity of a gesture-based user interface not to make
use of location. A possible solution would be to train the hidden Markov models with very sparse
data in terms of gesture position. This would, however, lead to a poor recognition rate since the
sparseness of data results in a large variance in the gesture models and further, to the spotting
of incorrect sequences. Another rejected feature vector is the difference vector between successive
data points. The small variation in the values of this feature vector during hand movement causes
confusion for dissimilar gestures or random movement.
Experiments, described as follows, were conducted to determine which of the remaining pro
posed feature vectors should be the ones considered in the current framework.
(1) acquire multiple repetitions of each of the three gestures (see Figure 5.1) discretely, while
logging raw mouse input data in a file1
(2) perform the training of hidden Markov models with the previous data for every considered
feature vector
1 We chose twenty repetitions as an arbitrary number that was not too onerous for new users, yet sufficient to achieve reasonable fidelity in the trained models.
57
5.1 CONTINUO US DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS
Sequence 11 Recognition Features Place Pick Delete # ins. # subs.
Angle vector quantized * 1.0 1.0 1.0 1 0 Angle vector 1.0 1.0 1.0 0 0 Delta vector 1.0 0.3 0.8 3 2
Polar coordinates 0.8 1.0 1.0 0 0 Sequence 12 Recognition
Features Place Pick Delete # ins. # subs. Angle vector quantized 1.0 1.0 0.6 0 3
Angle vector* 1.0 1.0 0.9 2 1 Delta vector 0.9 0.6 0.9 1 0
Polar coordinates 0.9 1.0 0.9 3 1
Sequence 13 Recognition Features Place Pick Delete # ins. # subs.
Angle vector quantized 1.0 1.0 0.9 1 1 Angle vector 1.0 1.0 0.6 0 4 Delta vector 1.0 0.2 1.0 2 0
Polar coordinates* 1.0 1.0 0.8 0 2 Sequence 14 Recognition
Features Place Pick Delete # ins. # subs. Angle vector quantized 0.9 1.0 0.8 3 3
Angle vector 0.9 1.0 0.8 4 3 Delta vector* 1.0 1.0 1.0 3 0
Polar coordinates 0.9 1.0 0.9 0 1
Total Recognition Features Place Pick Delete # ins. # subs.
Angle vector quantized 0.975 1.0 0.825 5 7 Angle vector 0.975 1.0 0.825 6 8 Delta vector 0.975 0.525 0.925 5 2
Polar coordinates 0.9 1.0 0.9 3 4
TABLE 5.1. Recognition rate for feature selection, including the number of insertions (# ins.) and substitutions (# subs.)
(3) acquire multiple repetitions of each of the three gestures continuously in a realistic appli
cation context, while logging raw mouse input data in a file2
(4) for every feature vector, perform the recognition pro cess on the realistic sequence and
measure the recognition rate
The selected feature sets are the following: delta positions (dx, dy), the movement vector
in polar coordinates (r, 0), the movement vector angle (0) and the quantized movement vector
angle (Oq), as suggested by Lee [59]. Although each of the four feature vectors was evaluated
2We chose ten repetitions as an arbitrary number that was not too onerous for new users, yet sufficient to obtain adequate recognition rates.
58
5.1 CONTINUOUS DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS
with respect to its recognition accuracy, recognition was also performed on-line during this pro cess
in order to provide the user with visual feedback. Each sequence in Table 5.1 indicates by an
asterisk which vector was used for this purpose. Fort y repetitions of each gesture are therefore
considered to determine which feature vector leads to the highest recognition rate. The recognition
rate is calculated using the ratio between recognized gestures over the number of performed gestures,
thus not considering insertions. The latter recognition error occurs when random movement was
recognized as being a gesture, whereas a substitution happens when there is confusion between two
models. A deletion is considered when a gesture was performed without having been spotted.
As seen in the results of Table 5.1, every considered feature vector offers approximately the
same recognition rate, and the number of insertions and substitutions do not differ significantly. The
results could have been easily predicted, because "polar coordinates" and "delta vector" actually
provide the same data to the recognizer, albeit in different representations. Since the use of angle
vector results in approximately the same recognition rates, it has been decided that the vector
magnitude, or movement speed, does not provide any further information to the recognizer in most
cases. Therefore, the angle vector is considered the most meaningful feature vector for the rest of
the tests. The quantization of the angle did not offer significantly better results.
It is however possible to recover information related to the gesture's velocity of execution from
raw data. Since usual gesture capturing devices such as mouse or data glove are sampled at a fixed
rate, the number of data points that compose the gesture is dependent on the speed of execution.
The faster a gesture is executed, the smaller the number of data points it will be composed of, and
vice-versa. With the input token data passing system, it is possible for a user who needs information
on speed of execution to recovèr it.
The number of insertions and substitutions for the considered feature vector is relatively high
compared to what would be expected from an accurate gesture recognition system. Most of the
confusion in the gesture recognition procedure occurs when the "delete" gesture is performed, which
is recognized as being the "pick" gesture. This confusion is related to the fact that sometimes
the "dei ete" gesture is executed with less precision, leading to rounded changes of direction. The
performed gesture hence becomes recognized as being more of a circular shape than a back and forth
movement, which leads to confusion with the "pick" gesture. A possible solution to this problem
would be to train the "delete" gesture with non ideal sequences, similar to the ones performed
59
5.1 CONTINUOUS DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS
during the recognition stage. Achievahle ways of implementing the latter solution would he to use
a larger collection of training data from a wide range of users, or interactively train HMMs during
the recognition stage [15,102].
On a more qualitative note, the number of insertions and substitutions that can be observed in
Table 5.1 gives a good idea as to which feature vector will be the most interesting to use for novice or
experienced users. The feature vectors that do not have a high recognition rate have less insertions
or substitutions, but more deletions, which is the number of errors minus the number of substitutions
that are considered in the recognition rate calculation. Therefore, in order to be recognized, gestures
have to be performed more precisely, similar to the ones that were used to train the hidden Markov
models. This behaviour would be acceptable for novice users who do not know perfectly how the
system works. It would however not be applicable for experienced users who perform gestures faster
and less accurately. Their tolerance to recognition errors is likely to be higher because experienced
users know how to recover from classification errors.
Recognition Sequence number Place Pick Delete nb. ins. nb. subs.
11 1.0 1.0 1.0 0 0 12 1.0 1.0 1.0 1 0 13 1.0 1.0 0.9 0 1 14 0.8 0.9 1.0 0 3
total 1 0.925 1 0.975 1 0.975 1 1 4
TABLE 5.2. Mouse-based gesture recognition rate with improved trained HMMs
5.1.2. Mouse-Based Gesture Recognition Rate. In this section, the training set of
gestures was selectively constructed in order to improve the recognizer's efficiency for the same
recognition sequences as in the previous section. Improving a model consists of ad ding samples
to the training database, then performing the recognition process, and executing another training
pass. This is done in order to ensure that the gesture model takes into account sequences similar to
the ones that were not recognized in the first recognition round. This procedure takes place until
satisfactory results are obtained. Enhanced models were the "pick" and "delete" gestures, which
were too often substituted when using the original training sets. The improvements generally lead
to higher recognition rates and less insertions and substitutions as can be seen in Table 5.2. Results
60
5.1 CONTINUO US DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS
-t :::::- ~ ...... _._-~ tf\ Loop Square Cross Delete Angle Infinity Fish Triangle Circle Wedge
FIGURE 5.2. Large gesture set
show that over 120 performed gestures, an average of 96% were recognized with a database composed
of three gestures.
For a larger number of gestures, it is expected that the recognition rate will be lower since
gestures are easily confused, especially very similar ones. Figure 5.2 shows ten different gesture
models used as a dataset in order to test the recognizer's performance for a large number of potential
hypotheses. In this experiment, the ten gestures are trained using twenty repetitions each and the
recognition tests use ten continuous repetitions of every gesture.
TABLE 5.3. Mouse gestures recognition rate for a large number of possible gestures, including insertion and substitution errors
As seen in Table 5.3, the recognition rate for a large number of possible gestures is satisfying,
with an average of 88%. The most highly confusable gesture is "fish", which is often confounded
with "loop", likely so because these two gestures have similar shapes and starting directions, which
leads the recognizer to confusion since the starting point of a gesture sequence is unknown. It should
be noted that the number of insertions reported in Table 5.3 does not take into account insertions
that occur during random mouse movement, but only the ones that occur during the actual gesture
performance. It is therefore obvious that gestures akin to "fish", "loop" and "circle" are likely to
be recognized whenever the mouse movement is similar to the trained models. Choosing gestures
different from free-hand movement is therefore of crucial importance.
5.1.3. Glove-Based Gesture Recognition Rate. Gesture recognition experiments have
also been performed using a P5 data glove [31]. The training set is, as with the mouse, composed of
twenty repetitions for each of the three considered gestures. The recognition stage was to perform
about twenty samples of every gesture.
61
5.1 CONTINUOUS DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS
As seen in Table 5.4 in the "original models" section, the recognition rate is much lower than
when the mouse is used as an input modality, with an.average recognition rate of 75%. The disap-
pointing outcome is not a problem of sampling precision since the P5 data glove offers a resolution
of 0.3 cm [31], which is sufficient given that the amplitude of a typical gesture is on the order of tens
of centimetres. Poor training of gesture models is the cause of all encountered errors during those
tests. The "place" gesture was not trained correctly for this run since lots of insertions occurred
when performing "move" or "delete" gestures. A human factor explains why it is so hard to obtain
an accurate gesture model in the training: fatigue. The training phase of the experiment was to
perform twenty repetitions of each gesture, successively. It is however quite tiring for a human to
hold their arm in the air for long periods of time. This is why a second run of the same experiment
has been conducted, though allowing the user some time to rest after five actual gestures, expecting
higher recognition rates.
Original models Place Move Delete
Recognition rate 0.71 0.79 0.74 Errors Substitutions with Insertions of Lots of insertions
"move" "place" of "place"
Improved models Place Move Delete
Recognition rate 0.81 1.0 0.81 Errors Deletions None Substitutions with
"place" and "pick"
TABLE 5.4. Recognition rate of glove gestures with original and improved models
As seen in the "improved models" section of Table 5.4, using enhanced models leads to a higher
average recognition rate of 87% for the three considered gestures, as weIl as fewer errors. Two
conclusions can be drawn from this last experiment: first, the quality of training data significantly
influences the expected recognition rate. The more accurate a model is, the easier the gesture will
be recognized at runtime. Secondly, it is important for free-hand gestures to avoid using ones that
need prolonged holding of the arm in the air. These should therefore only be used when they bring
another dimension to the interaction, such as executing virtual manipulation, as opposed to be
invoking every single operation. A well-known use of gestures is to refer to the spatial dimension of
62
5.2 THE CONTEXT GRABBER'S INFLUENCE ON THE RECOGNITION RATE
things, which can he unnatural to specify with other modalities. However, instruct commands to a
system can easily be done using speech, which will be discussed in Section 6.2.
5.1.4. Vision-Based Hand Gesture Recognition. The video-hased hand position cap-
ture system described in Section 3.5.3 was employed in order to provide input data to the HMM
gesture recognizer. However, due to the following system limitations, no meaningful data could he
obtained from the preliminary experiments: since the video tracker is still under development, its
performance is significantly below what would be expected of an appropriate tracking system. The
frame rate is approximately 16 position updates per second on a Pentium IV 2.6 GHz, which is not
sufficient for prolonged use without perceiving an annoying lag and delay between the actual hand
movement and the sight of the virtual cursor moving on the projection screen. Ware [99] reports
that the hand tracking frame rate and lag is critical for having decent interaction with a virtual
environment.
Preliminary experiments were conducted in order to show that hidden Markov models can be
trained using a video-based tracking system with the current framework. Simple gestures have also
been recognized. However, as the tracker performance does not offer aIl the accuracy and precision
that hand gestures need, more meaningful results should be obtained from increased performance
of the tracker in the future. Nevertheless, the software framework supports vision-based gesture
recognition, which is promising for the integration of additional modules.
5.2. The Context Grabber's Influence on the Recognition Rate
As presented in Section 4.7, the context grabber is used to restrain the number of possible
gesture hypotheses that can happen at every moment. This restriction takes advantage of the
application as weB as the user's status in order to eliminate gesture candidates. An experiment
showing differences in recognition rates between the inclusion and exclusion of the context grabber
has been conducted, using the large gesture set shown in Figure 5.2. The recognition rate is expected
to be similar for both situations since the same gesture models are used. However, more insertions
of incorrect gestures should be observed when the context grabber is not used, especially gestures
similar to a user's random movement.
Table 5.5 shows that the recognition rate do not vary significantly whether the context grabher
is included or excIuded, though being marginally lower when the context grabber is turned off.
63
5.3 THE EXPERIMENTAL APPLICATION
Gesture Context ON Context OFF
Sequence Measure Place Move Delete Place Move Delete
11 Rate 1.0 1.0 1.0 0.9 0.8 1.0
Insertions 1 0 0 2 4 1
12 Rate 1.0 0.9 1.0 0.8 0.8 1.0
Insertions 1 3 0 1 6 0
13 Rate 0.9 0.9 1.0 0.9 0.8 1.0
Insertions 2 2 0 2 9 2
14 Rate 0.9 1.0 1.0 0.9 1.0 1.0
Insertions 2 3 0 1 6 0
TABLE 5.5. Recognition results for large number of gestures with and without context grabber
However, a significant additional amount of gesture insertions occur without the context grabber.
This simple experiment shows two things: first that taking advantage of the context can help
improving overall recognition performance, especially when gestures share similar shapes (e.g. "loop"
and "circle"). Second, one should limit the number of possible gestures in a context to the minimum,
such that the recognizer will not be confused and the processing time it takes to generate hypotheses,
which is linear relative to the number of available gesture models, will not be too long. The rest of
the algorithm is constant time, bounded by the time needed to pro cess the valid hypotheses.
This simple experiment also shows that not only should the system take advantage of the
context in order to better recognize gestures performed by the user, but users should choose their
gestures in such a way that they will not be confounded with random movement that occurs between
two actual gestural expressions. A more complex context grabber would also help the system to
further constrain the number of gestures to be recognized, which will be discussed in Section 6.2.
5.3. The Experimental Application
An experimental application of the framework, which takes advantage of the available gesture
recognition input modalities, was implemented in order to demonstrate and test the validity of the
different concepts presented throughout this thesis. The application allows a user to place and
modify the appearance and state of virtual objects in a three-dimensional (3D) world. It is possible
to place three-dimensional models in the virtual space, or two-dimensional images on the screen
64
5.3 THE EXPERIMENTAL APPLICATION
FIGURE 5.3. Typical framework's application scene
plane.3 Those entities can then be moved around in the virtual world, using hand gestures. Three
dimensional models can also be textured and rotated using hand gestures. Entities can be deleted
and a history of applied actions is kept in order to allow undo and redo facilities.
A typical view of a virtual world might look like the one shown in Figure 5.3. In this scene,
different 3D models (sorne chairs, a plant and a fridge) were placed, moved, rotated and textured
in order to demonstrate the purpose of the current framework. Concrete actions were programmed
in order to allow the aforementioned operations on objects. The specifications of those actions and
what they expect from an input token and world entities can be seen in Annex B.l.
In addition to the mouse-based gesture recognizer, the "clockTick" input modality is used in
order to periodically emit tokens at specified time intervals. In the present case, a clock tick is sent
every 100 milliseconds, which is an arbitrary value set by the user. These periodic events trigger an
action that displays the system's information in a 3D text entity. This information gives a clue to
3Three-dimensional models are 3D Studio Max files.
65
5.3 THE EXPERIMENTAL APPLICATION
LChair} Place Model3D ~Fridge Placing
Plant Pick for rotation ----.. ~ Rotating Pick for translation ~ Trans/ating
LBrown
} Texture ~Grey Texturing
Red Place Image } Undo Idle Redo Delete
FIGURE 5.4. Experimental application's gesture dialogue
the user as to which state the interaction manager is in at every moment; in Figure 5.3 the system
is in "translating" mode. Providing this information helps users to recover from errors since there
would otherwise be no way of knowing if a gesture would have been recognized correctly or if an
insertion of a wrong gesture would have happened.
A gesture dialogue is proposed in order for a user to reuse known gestures for invoking many
actions. This dialogue is managed with gestures that adjust the interaction manager state, as seen
in Figure 5.4, where new states are in italic. For instance, in order to place a new three-dimensional
model, a user needs to perform the "placeModeI3D" gesture, which notifies the interaction manager
that the system now needs to recognize gestures that can be performed when in "placing" mode.
In order to exit the "placing" mode, a user needs to execute a gesture that brings the interaction
manager back to the "idle" mode, or whatever was specified in the configuration. The same scheme
is employed for texturing, rotating and translating 3D objects. Since the interaction manager needs
to be in a specific state in order for actions to be applied to certain entities, gestures can be reused
for invoking multiple actions. For instance, placing a "chair" and texturing a model "brown" can
both be invoked using the same gesture without any conflict or misinterpretation. In addition, the
context grabber is used to reduce the number of gestures to be recognized at every moment. It
therefore allows users to define specifie gestures that can be similar for two actions applied in two
different interaction manager's contexts.
For a complete reference on the XML configuration file used to configure the experimental
application, see Annex C.
66
5.4 FRAMEWORK'S PERFORMANCE WITH A LARGE NUMBER OF ENTITIES
5.4. Framework's Performance With a Large Number of Entities
The experimental application was used in order to test the entire software framework perfor-
mance. In this particular application, it is of crucial importance for the framework to keep a decent
refresh rate for the OpenGL output modality [99], even if the virtual world is cluttered with a large
number of world entities. Two restrictions must be taken into account when using world entities that
are displayed on the screen in a three-dimensional OpenGL environment: the maximum numbet of
polygons that can be rendered by the graphics card and the maximum number of world entities that
can be processed by the interaction manager every time an input token is emitted. Experiments were
conducted to observe the effect of adding world entities to the virtual world in terms of OpenGL
display frame rate and the interaction manager's processing time.
-
Frame rate function of number of entities (debug version)
o~~~~~~~~~~4-~~~~~~~~~~
o 200 400 600
Number of entities
800 1000 1200
I
--+-Idle 1 _____ Moving cursor
-A-Moving entity
FIGURE 5.5. Framerate funtion of number of entities (debug version)
The software was run on a Pentium M 1.8 GHz, 512 MB RAM, ATI Mobile FireGL T2 128
MB RAM running on Windows XP Pro SP2, compiled with Microsoft Visu al Studio .NET 2003
in debug and release versions, using a ACLHigh_Res_Timer [81) to measure the processing time. A
Win32 timer triggers the rendering at a maximum frame rate of 50 Hz for both versions as seen in
Figures 5.5 and 5.6 when the number of entities is low. The "idle" curves of these two figures shows
67
5.4 FRAMEWORK'S PERFORMANCE WITH A LARGE NUMBER OF ENTITIES
Frame rate function of number of entities (release version)
I :~ldle 1 _Moving cursor
-6--Moving enlity
O~~~~~~~~~~4-~~~~~~~~~~
o 200 400 600 800 1000 1200
Number of entities
FIGURE 5.6. Framerate funtion of number of entities (release version)
the frame rate when the cursor is not moving, which means that the processing time is entirely
spent on drawing the scene. In debug mode, the "idle" curve shows a significantly higher frame rate
than when the cursor is moving. In the latter condition, the interaction manager has to pro cess
"moveCursor" input tokens, which results in finding out whether or not an action has to be applied
to every entity. In addition, the gesture recognizer has to pro cess incoming data in order to find
out if the sequence of incoming positions corresponds to a known gesture. This leads to a decreased
frame rate as observed on the "Moving cursor" curve in the debug version of the program. When an
entity has been picked for translation and is being moved around in the virtual space, an additional
step has to be processed, which is validating the possible entities that can be moved. In fact,
when an entity is picked, the interaction manager's state becomes "translating", which allows the
"translate" action to be applied to "translatable" entities. In the present case, every entity owns the
"translatable" behaviour; hence, they aIl need to be validated by the action. This "validateEntities"
supplementary step takes a significant amount of time in the debug mode, and keeps only the entity
whose "picked" property is set to "true". This is why a lower frame rate can be observed on the
"Moving entity" curve. The debug version of the software is interesting because it shows which parts
68
5.4 FRAMEWORK'S PERFORMANCE WITH A LARGE NUMBER OF ENTITIES
of the interaction manager should benefit the most of an optimization that is, in the present case,
the action entities validation method.
Unlike the debug version, the three curves displaying the release version's performance in Fig-
ure 5.6 do not show a large difference between the three aforementioned cursor's states. In fact, the
performance drop caused by having picked an entity is negligible compared to the time needed to
render the scene. The large difference between the debug and release versions of the software can be
explained by the fact that the release version uses an optimized version of the C++ library, which
includes the STL library. Since STL is extensively used in the software, the performance increases
accordingly compared to using the debug version. The release version of the software keeps a frame
rate of 24 Hz, below which humans notice a sloppiness, for a number of entities around 850. The
entities used in the experimentation are two-dimensional images made up of 128x128 pixels textures
displayed on an orthogonal view.
Interaction manager's processing lime function of number of entilies
70~-------------------------------------,
60~------------------------------~~--~
50~------------------------~~--------~
§. ~ 40r-------------------~~--------------~
Cl
" -~ 30r-------------~~--------------------~ u o ct 20r---------~~------------------------~
10r---~~------------------------------~
o 200 400 600
Number of entities
800 1000 1200
-+-Oebug version _____ Release version
FIGURE 5.7. Interaction manager's processing time function of number of entities
Figure 5.7 shows a plot of the processing time needed to parse a "moveCursor" token when an
entity is picked, as a function of the number of entities. The curve showing the performance of the
debug version appears to be linear with a slope of approximately 6 ms per 100 entities, the algorithm
69
5.5 GENERAL DISCUSSION AND LIMITATIONS
being linear in terms of entities for a given number of actions. However, the release version of the
software shows a different curve that tends to be linear, with a much lower slope that is far below 1
ms per 100 entities. This plot therefore shows that the software limitation is not a matter of number
of actions and entities, but rather a number of polygon rendering, which could be improved with
optimized OpenGL commands and textures [83].
When performing this experiment, it was noticed that the different software threads' priorities
play an important role in the visual user feedback. When using a mouse-based gesture recognizer
and an OpenGL display, three concurrent threads are running in the software: one that sends input
tokens to the interaction manager (input modality's thread), one that takes the input tokens, parses
them and applies actions on the world entities (interaction manager's thread), and another that
renders the world in OpenGL displays (main thread). The main display thread is triggered every 20
milliseconds by a Win32 timer, which tries to meet the 24 frames per second requirement. However,
the interaction manager's thread needs to have a higher priority than the display thread in order
to invoke actions corresponding to the input events first, and then draw the results. The opposite
would have the possible effect of drawing non-updated entities on the screen while tokens would be
waiting in the queue to be processed. On the other hand, the mouse thread needs to have a higher
priority than the display thread, but it is more important to pro cess input tokens than collecting
new data.
This empirical adjustment of threads' priority removes the lag in the mouse cursor's position
that could be observed when aIl threads had the same priority with a large number of entities
displayed. Instead of a lag on the screen, the effect of too many input tokens sent at the same time
is that points are missing in the mouse trajectory. This is notably due to the fact that mouse events
originate from the main thread, which does not have the highest priority and that drops mouse
positions when they are too numerous in the queue.
5.5. General Discussion and Limitations
The continuous gesture recognition algorithm and the implemented software framework are ob
viously not perfect and have several limitations that will be described in this section. The gesture
spotting algorithm succeeds when finding simple gestures from the input data stream, but more com
plex ones are hard to recognize especiaIly when they are composed of known simpler gestures. This
70
5.5 GENERAL DISCUSSION AND LIMITATIONS
limitation does not allow for long and complex gestures to be recognized. However, natural gestures
are almost never complex, so this limitation might not be that crucial for such an interface [70].
Another problem is that the gesture spotting algorithm occasionally introduces errors in the
gesture starting point detection. This is notably due to the often incorrect assumption that the
best hypothesis should be the longest gesture. The gesture spotting algorithm should also be more'
constrained on the acceptance of the best hypothesis. Preliminary user testing shows that they find
insertions more annoying than deletions from the gesture stream. These are just qualitative results
that should be revised and confirmed in the future with real experimental data. More feedback to
the user should also be provided, for instance, show which gestures are associated with which actions
or show the anticipated beginning of a spotted gesture.
As for the software framework, no critical performance limitation was observed while imple
menting the small application presented earlier in this chapter, nor for a version of the program that
uses OpenSceneGraph [69] as an output modality. The author of the framework however wrote the
additional actions and world entities, which may be a bias in favour of the framework. With future
use, more conclusions will be drawn on the flexibility and extensibility that the software framework
offers.
Additionally, the "clockTick" input modality allows for time-stepped actions, which is interest
ing for animations or any operation that would need periodic updates. One problem that could be
encountered is the lack of scheduler in the interaction manager, as the input tokens are parsed in a
first-in-first-out (FIFO) strategy. A scheduler would allow selected input tokens to be parsed before
others, as needed.
With larger applications cornes the problem of a large XML configuration file. The software
framework would benefit a simple graphical user interface (GUI) that would allow configuring every
component before starting the software. Such a GUI would also benefit having utilities in order to
configure at runtime the entities and every manager used in the application.
71
CHAPTER 6
Conclusion and Future Work
6.1. Conclusions
In this thesis, the problem of recognizing gestures using various input modalities was addressed.
Standard hidden Markov models were used in order to recognize temporal sequences of gesture fea
tures. HMMs are extensively used in speech recognizers and are adaptable for gesture recognition,
given an appropriate feature vector choice. The most advantageous feature vector that was chosen
is the angle of the vector describing the hand movement. LTI-Lib was used as an implementation
of hidden Markov models data structure and algorithms. The training algorithm is the segmental
K-means, which optimizes the HMM's parameters for the most likely state sequence only, rather
than optimizing the model for every state sequence like the Baum-Welch algorithm. The recognition
algorithm is composed of three parts: hypotheses generation, hypotheses scoring (Viterbi algorithm)
and pruning, and gesture spotting. Gesture spotting is necessary due to continuous gesture recogni
tion, which means that both start and end points of a gesture are unknown. A gesture is considered
as being spotted when the most likely hypothesis fits criteria that take advantage of the HMM's
state structure.
In order to allow multiple input and output modalities to be used in the general and flexible
context of a virtual world model, a software framework was developed. This framework uses a generic
data container in order to represent knowledge in all the data pipeline modules. The data flow is
initiated using input tokens emitted every time an event occurs in an input modality. Input tokens
contain aB the information that one needs to correctly analyze what happened in the real world, and
execute further associated operations on the virtual world. In order to apply actions on the virtual
6.2 FUTURE WORK
world, input tokens are parsed by an interaction manager that, given several different constraints,
decides which actions need to be applied to corresponding world entities. These constraints are
defined by a set of behaviours that belong to the world entities, which should match the associated
action's behaviours and data in order to be triggered. To be interactive, every world entity has
associated behaviours that furnish properties, which are to be modified by actions and retrieved in
order to affect the rendering process. This latter operation is performed by the output modalities,
which caU the rendering method of every world entity that knows how to render itself in a specifie
output modality type. The rendering pro cess is generic, thus aUowing not only visual output, but
any kind of output modality to share the same data pipeline.
An experimental application was implemented in order to demonstrate the various concepts
developed throughout the thesis. Several gesture input modalities (mouse, glove, vision) were also
implemented in order to experiment gesture interaction with virtual worlds. Basic experiments were
conducted in order to test recognition rates under sever al conditions, which prove that continuous
gesture recognition is possible at a reasonable recognition rate, recognizing in the order of 10 HMMs
at the same time.
6.2. Future Work
In terms of future work, it would be interesting to experiment with user testing in order to
establish if whether or not gestures are usable as a unique modality when interacting with a virtual
world. Speech input modality would also be interesting to integrate in the framework in order to
provide a way for users to say a given command instead of performing a gesture. Such a multimodal
system would use multimodal commands in order to take advantage of both modalities. In addition,
a modality integrator would be necessary in order to manage and merge information coming from
multiple input sources. Such work is currently being pursued in the SRE laboratory and is planned
to be integrated in the near future.
In terms of gesture interaction, richer gesture syntax would provide supplementary parameteri
zation possibilities. The position and curvature of fingers could be used for giving extra information
and context to the system, allowing more complex grammars to be used. Sorne work has already
been done in that direction in the SRE laboratory, providing algorithms to segment the hand and
detect fingertips. The remaining step is the integration effort of these algorithms in the current
73
6.2 FUTURE WORK
framework as weB as a gesture recognizer's adaptation, so it can recognize gestures parameterized
with different finger positions. A more complex feature vector and possibly recognition algorithm
would then be needed in order to adapt to the new input stream. Additional gesture-based input
modalities could also be implemented. For exampIe, systems such as those developed by Polhe
mus could allow for accurate three-dimensional position and rotation in contrast with the current
vision-based system, which has not yet reached the desired Ievel of accuracy and speed.
Online gesture training could also be implemented in order to keep the gesture models up-to
date while the user is performing them. This additional training phase would probably improve
overall recognition rate and would decrease the number of insertions. There should however be
a way of indicating to the system when a gesture was not correctly recognized, so the training
gestures are only those confirmed of having been accurately spotted. Garbage models could also be
trained, taking advantage of known incorrect gestures in order to discriminate easily wrong gestural
expressions.
Improved context grabbing could also be implemented in order to retrieve mare data from the
user's state, which would put additional constraints on the possible actions that can occur at any
point in time. For example, a system that would use eye tracking would know where the user is
looking at, and would be able to restrain the actions to the ones associated with a given "target"
object.
Enhanced network support could also be implemented using a middleware such as CORBA in
order to manage objects remotely, transparently for the user. ORB services could also be used,
such as event channels or timing services, that would allow synchronization between multiple virtual
environment instances.
Finally, it is worth to mention that the goal of the current thesis was to prove that gestures can
be used in arder to control a virtual world, but there are several drawbacks in only using gestures.
This is why the pr~ented software framework was designed and implemented with the idea in mind
that one day it would be used for other modalities than gesture inputs. The next step in the research
is to incorporate other input modalities such as speech and build an output modality that could be
used in a CAVE immersive environment like the one that is owned by the SRE laboratory. This
would be a step forward toward immersive computing.
74
REFERENCES
[1] Marcell Assan and Kirsti Grobel, Video-based sign language recognition using hidden Markov
models, International Gesture Workshop on Gesture and Sign Language in Human-Computer
Interaction, Springer, 1998, pp. 97-109.
[2] Thomas Baudel and Michel Beaudouin-Lafon, CHARADE: remote control of objects using
jree-hand gestures, Commun. ACM 36 (1993), no. 7, 28-35, ACM Press.
[3J François Bernier, Denis Poussart, Denis Laurendeau, and Martin Simoneau, Interaction
centric modelling for interactive virtual worlds: The APIA approach, 16 th International
Conference on Pattern Recognition (ICPR'02) Volume 3, IEEE Computer Society, 2002.
[4] Allen Bierbaum, Christopher Just, Patrick Hartling, Kevin Meinert, Albert Baker, and Car
olina Cruz-Neira, VR Juggler: A virtual platform for virtual reality application development,
Virtual Reality 2001 Conference (VR'Ol), IEEE Computer Society, 2001, p. 89.
[5] Jeff Bilmes, What HMMs can do?, Tech. report, University of Washington, 2002.
[6] Henrik Birk and Thomas Baltzer Moeslund, Recognizing gestures from the hand alphabet
using principal component analysis, Masters thesis, Aalborg University, Denmark, 1996.
[7] Michael J. Black and Allan D. Jepson, Recognizing temporal trajectories using the CON
DENSATION algorithm, 3rd. International Conference on Face & Gesture Recognition,
1998, pp. 16-21.
[8] Aaron F. Bobick and James W. Davis, The recognition of human movement using tempo
ral templates, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001), no. 3, 257-267, IEEE
Computer Society.
REFERENCES
[9J Aaron F. Bobick and Yuri A. Ivanov, Action recognition using probabilistic parsing, IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Com
puter Society, 1998, pp. 196-202.
[lOJ Richard A. BoIt, "Put-That-There": Voice and gesture at the gmphics interface, SIGGRAPH
'80, 7th annual conference on Computer graphies and interactive techniques, ACM Press,
1980, Seattle, Washington, United States, pp. 262-270.
[I1J Yves Boussemart, François Rioux, Frank Rudzicz, Mike Wozniewski, and Jeremy R. Coop
erstock, A fmmework for 3d visualization and manipulation in an immsersive space using
an untethered bimanual gestuml interface, ACM Symposium on Virtual Reality Software
and Technology, ACM Press, 2004.
[12J Matthew Brand, Nuria Oliver, and Alex Pentland, Coupled hidden Markov models for com
plex action recognition, Conference on Computer Vision and Pattern Recognition (CVPR
'97), IEEE Computer Society, 1997, pp. 994-999.
[13J Peter Bull, State of the art: Nonverbal communication, The Psychologist 14 (2001), 644-647.
[14J Lee W. Campbell, David A. Becker, Ali Azarbayejani, Aaron F. Bobick, and Alex Pentland,
Invariant features for 3-d gesture recognition, Automatic Face and Gesture Recognition,
1996, pp. 157-163.
[15J Xiang Cao, An explomtion of gesture-based intemction, Masters thesis, Department of Com
puter Science, University of Toronto, 2004.
[16J Xiang Cao and Ravin Balakrishnan, Evaluation of an online adaptive gesture interface with
command prediction, Graphical Interface Conference, 2005, pp. 187-194.
[17J Michael Capps, Don McGregor, Don Brutzman, and Michael Zyda, NPSNET- V: A new
beginning for dynamically extensible virtual environments, IEEE Comput. Graph. Appl. 20
(2000), no. 5, 12-15.
[18J Chris ter Carlsson and Olof Hagsand, DIVE - a multi user virtual reality system, IEEE
Virtual Reality Annual International Symposium, 1993, pp. 394-400.
76
REFERENCES
[19J Jeremy R. Cooperstock, Interacting in shared reality, HCI International, Conference on
Human-Computer Interaction (Las Vegas), 2005 (to appear),
http:j jwww.cim.mcgill.cajsrejpublicationsjhci05.pdf.
[20J Immersion Corp., CyberGlove, http:j jwww.immersion.comj3djproductsjcybeLglove.php.
[21J ___ , CyberGrasp, http://www.immersion.com/3d/products/cybeLgrasp.php.
[22J Microsoft Corp, Raw input, http:j jmsdn.mierosoft.comjlibrary jdefault.asp?url=jlibrary jen
usjwinuijwinuijwindowsuserinterfacejuserinputjrawinput.asp.
[23J Ascension Technology Corporation,
tech.comjproductsjflockofbirds.php.
Flock of birds, http:j jwww.ascension-
[24J Carolina Cruz-Neira, Daniel J. Sandin, Thomas A. DeFanti, Robert V. Kenyon, and John C.
Hart, The CA VE: audio visu al experience automatic virtual environment, Commun. ACM
35 (1992), no. 6, 64-72, ACM Press.
[25J Ross Cutler and Matthew Turk, View-based interpretation of real-time optical fiow for ges
ture recognition, Automatie Face and Gesture Recognition, 1998, pp. 416-42l.
[26J Marek Czernuszenko, Dave Pape, Daniel Sandin, Tom DeFanti, Gregory L. Dawe, and Max
ine D. Brown, The ImmersaDesk and infinity wall projection-based virtual reality displays,
Computer Graphies 31 (1997), no. 2, 46-49.
[27J Andries van Dam, Post- WIMP user interfaces, Commun. ACM 40 (1997), no. 2, 63-67.
[28J Trevor Darrell and Alex P. Pentland, Space-time gestures, Conference on Computer Vision
and Pattern Recognition, 1993, pp. 335-340.
[29J Konstantinos G. Derpanis, A review of vision-based hand gestures, Tech. report, York Uni
versity, Toronto, Canada, 2004.
[30J Konstantinos G. Derpanis, Richard P. Wildes, and John K. Tsotsos, Hand gesture recognition
within a linguistics-based framework, ECCV04, Springer, 2004, 3021, pp. 282-296.
[31J dimensionline, P5 glove, http://www.p5glove.com.
[32J Irfan A. Essa and Alex P. Pentland, Facial expression recognition using a dynamic model
and motion energy, Fifth International Conference on Computer Vision, IEEE Computer
Society, 1995, pp. 360-367.
77
REFERENCES
[33] Andrew Fischer and Judy M. Vance, PHANToM haptic device implemented in a IJrojection
screen virtual environment, Workshop on Virtual environments 2003, ACM Press, 2003,
Zurich, Switzerland, pp. 225-229.
[34] G. David Forney Jr., The Viterbi algorithm, Proc IEEE 61 (1973), 268-278.
[35] Apache Software Foundation, Xerces-G++, http://xml.apache.org/xerces-c.
[36] Jean-Marc François, Jahmm, 2005,
http://www.run.montefiore.ulg.ac. bel ",francoisl softwarefjahmm/.
[37] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design patterns: elements
of reusable object-oriented software, Addison-Wesley Longman Publishing Co., Inc., 1995.
[38] Zoubin Ghahramani, An introduction to hidden Markov models and Bayesian networks,
Hidden Markov models: applications in computer vision, World Scientific Publishing Co.,
Inc., 2002, World Scientific Publishing Co., Inc., pp. 9-42.
[39] GHMM, 2004, http://www.ghmm.org/.
[40] Benjamin A. Goldstein, Tandem: A component-based framework for interactive, collabora
tive virtual reality, Masters thesis, University of Illinois, Chicago, USA, 2000.
[41] Chris Greenhalgh and Steve Benford, MASSIVE: a distributed virtual reality system incor
porating spatial trading, 15th International Conference on Distributed Computing Systems
(ICDCS'95), IEEE Computer Society, 1995, pp. 27-34.
[42] Chris Greenhalgh, Jim Purbrick, and Dave Snowdon, Inside MASSIVE-3: flexible support
for data consistency and world structuring, Third international conference on Collaborative
virtual environments (San Francisco, California, United States), ACM Press, 2000.
[43] Object Management Group, Unified modeling language, http:j Iwww.uml.org.
[44] Yves Guiard, Asymetrie division of labor in human skilled bimanual action: the kinematic
chain as a mode l, Journal of motor behavior 19 (1987), no. 4, 486-517.
[45] Martin Hachet, Pascal Guitton, and Patrick Reuter, The GAT for efficient 2d and 3d inter
action as an alternative to mouse adaptations, ACM symposium on Virtual reality software
and technology (Osaka, Japan), ACM Press, 2003.
78
REFERENCES
[46] Patrick Hartling, Allen Bierbaum, and Carolina Cruz-Neira, Tweek: Merging 2d and 3d in
teraction in immersive environments, 6th World Multiconference on Systemics, Cybernetics,
and Informatics (Orlando, Florida), 2002.
[47] Ken Hinckley, Patrick Baudisch, Gonzalo Ramos, and François Guimbretière, Design and
analysis of delimiters for selection-action pen gesture phrases in scriboli, SIG CHI conference
on Human factors in computing systems (Portland, Oregon, USA), ACM Press, 2005.
[48] IGN Entertainment inc., planet Black and White, http:j jwww.planetblackandwhite.com.
[49] Hiroshi Ishii and Brygg Ullmer, Tangible bits: towards seamless interfaces between people,
bits and atoms, SIG CHI conference on Human factors in computing systems, ACM Press,
1997, Atlanta, Georgia, United States, pp. 234-241.
[50] Yoshio Iwai, Hiroaki Shimizu, and Masahiko Yachida, Real-time context-based gesture recog
nition using HMM and auto maton, International Workshop on Recognition, Analysis, and
Tracking of Faces and Gestures in Real-Time Systems, IEEE Computer Society, 1999,
pp. 127-134.
[51] Yoshio Iwai, Ken Watanabe, Yasushi Yagi, and Masahiko Yachida, Gesture recognition using
colored gloves, International Conference on Pattern Recognition (ICPR '96) Volume l, IEEE
Computer Society, 1996, pp. 662--666.
[52] Biing-Hwang Juang and Lawrence R. Rabiner, The segmental k-means algorithm for esti
mating parameters of hidden Markov models, IEEE Transaction on Acoustic, Speech and
Signal Processing 38 (1990), no. 9, 1639-1641.
[53] John Kelso, Steven G. Satterfield, Lance E. Arsenault, Peter M. Ketchan, and Ronald D.
Kriz, DIVERSE: a framework for building extensible and reconfigurable device-independent
virtual environments and distributed asynchronous simulations, Presence: Teleoper. Virtual
Environ. 12 (2003), no. 1, 19-36, MIT Press.
[54] Adam Kendon, Current issues in the study of gesture, The biological foundations of gestures:
motor and semiotic aspects (1986), 23-47.
79
REFERENCES
[55] Nils Krahnstover, Sanshzar Kettebekov, Mohammed Yeasin, and Rajeev Sharma, A real
time framework for natural multimodal interaction with large screen displays, ICMI, 2002,
pp. 349-354.
[56] Robert M. Krauss, Yihsiu Chen, and Purnima Chawla, Nonverbal behavior and nonverbal
communication: What do conversational hand gestures tell us?, Advances in experimental
social psychology (M. Zanna, ed.), Tampa: Academic Press, 1996, pp. 389-450.
[57] Takeshi Kurata, Takashi Okuma, Masakatsu Kourogi, and Katsuhiko Sakaue, The hand
mouse: GMM hand-color classication and mean shift tracking, IEEE ICCV Workshop on
Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems (RATFG
RTS'Ol), IEEE Computer Society, 2001.
[58] Marcus Vinicius Lamar, Hand gesture recognition using T-CombNET - a neural network
model dedicated to temporal information processing, Ph.D. thesis, Nagoya Institute of Tech
nology, 2001.
[59] Hyeon-Kyu Lee and Jin H. Kim, An HMM-based threshold model approach for gesture recog
nition, IEEE Trans. Pattern Anal. Mach. Intell. 21 (1999), no. 10,961-973, IEEE Computer
Society.
[60] David McNeil, Language and gesture, Cambridge University Press, Cambridge, 2000.
[61] Marielle Mokhtari, François Bernier, François Lemieux, Hugues Martel, Jean-Marc
Schwartz, Denis Laurendeau, and Alexandra Branzan-Albu, Virtual environment and
sensori-motor activities: Haptic, auditory and olfactory devices, The 12th International
Conference in Central Europe on Computer Graphics, Visualization and Computer Vision
(WSCG2004), vol. 1-3 Feb. 2-62004, UNION Agency - Science Press, 2004, pp. 109-112.
[62] Darnell Janssen Moore, Vision-based recognition of actions using context, Ph.D. thesis, Geor
gia Institute of Technology, Atlanta, GA, 2000.
[63] Mozilla, Mouse gestures, http:j joptimoz.mozdev.orgjgesturesj.
[64] Martin Naef, Edouard Lamboray, Oliver Staadt, and Markus Gross, The blue-c distributed
scene graph, Workshop on Virtual environments 2003 (Zurich, Switzerland), ACM Press,
2003.
80
REFERENCES
[65] Chan Wah Ng and Surendra Ranganath, Real-time gesture recognition system and applica
tion, Image Vision Comput. 20 (2002), no. 13-14, 993-1007.
[66] Jakob Nielsen, Heuristic evaluation, Usability inspection methods, John Wiley & Sons, Inc.,
1994, 189209, pp. 25-62.
[67] Kenji Oka, Yoichi Sato, and Hideki Koike, Real-time fingertip tmcking and gesture recogni-
tion, IEEE Comput. Graph. Appl. 22 (2002), no. 6, 64-71.
[68] OMG, CORBA, http://www.corba.org.
[69] OpenSceneGraph, http:j jwww.openscenegraph.org.
[70] Vladimir 1. Pavlovié, Rajeev Sharma, and Thomas S. Huang, Visual interpretation of hand
gestures for human-computer intemction: A review, IEEE Trans. Pattern Anal. Mach. Intell.
19 (1997), no. 7, 677-695, IEEE Computer Society.
[71] Vicon Peak, Motion capture systems, http://www.vicon.com.
[72] Polhemus, Tracking systems, http://www.polhemus.com.
[73] Francis K. H. Quek, Eyes in the interface, IVC 13 (1995), no. 6, 511-525.
[74] ___ , Unencumbered gestuml intemction, IEEE MultiMedia 3 (1996), no. 4, 36-47, IEEE
Computer Society Press.
[75] Francis K. H. Quek, Xin-Feng Ma, and Robert Bryll, A pamllel algorithm for dynamic
gesture tmcking, International Workshop on Recognition, Analysis, and Tracking of Faces
and Gestures in Real-Time Systems, IEEE Computer Society, 1999, pp. 64-69.
[76] Lawrence R. Rabiner, A tutorial on hidden Markov models and selected applications in speech
recognition, IEEE, vol. 77, 1989, pp. 257-286.
[77] Lawrence R. Rabiner and Biing-Hwang Juang, Introduction to hidden Markov models, IEEE
ASSP 3 (1986), no. 1,4-16.
[78] Gerhard Rigoll, Andreas Kosmala, and Stefan Eickeler, High performance real-time gesture
recognition using hidden Markov models, International Gesture Workshop on Gesture and
Sign Language in Human-Computer Interaction, Springer-Verlag, 1997, pp. 69-80.
[79] RWTH-Aachen, LTI-Lib, 2005, http:j jltilib.sourceforge.netjdocjhomepagejindex.shtml.
81
REFERENCES
[80] Kingsley Sage, A. Jonathan Howell, and Hilary Buxton, Developing context sensitive HMM
gesture recognition, Gesture Workshop, 2003, pp. 277-287.
[81] Douglas C. Schmidt, ACE adaptive
http://www.cs.wustl.edu/''-'schmidt/ACE.html.
communication environ ment,
[82] Atid Shamaie and Alistair Sutherland, Accurate recognition of large number of hand gestures,
2nd Iranian Conference on Machine Vision and Image Processing, 2003.
[83] Dave Shreiner, Bob Kuehne, Thomas True, and Brad Grantham, Performance OpenGL:
Platform-independent techniques, SIGGRAPH '04 Course, 2004.
[84] Sandeep Singhal and Michael Zyda, Networked virtual environments: design and implemen
tation, ACM Pressj Addison-Wesley Publishing Co., 1999.
[85] Opera Software, Mouse gestures in Opera, http:j jwww.opera.comjfeaturesjmousej.
[86] Thad Starner and Alex Pentland, Real-time American Sign Language recognition from video
using hidden Markov models, International Symposium on Computer Vision, IEEE Com
puter Society, 1995, pp. 265-270.
[87] William C. Stokoe, Sign language structure: an outline of the visual communication systems
of the American deaf, Linstock Press, 1960.
[88] Josephine Sullivan and Stefan Carlsson, Recognizing and tracking human action, 7th Euro
pean Conference on Computer Vision, Springer-Verlag, 2002, pp. 629-644.
[89] Donald Tanguay, Hidden Markov models for gesture recognition, Masters thesis, MIT, 1995.
[90] Russell M. Taylor, Thomas C. Hudson, Adam Seeger, Hans Weber, Jeffrey Juliano, and
Aron T. Helser, VRPN: a device-independent, network-transparent VR peripheral system,
VRST, 2001, pp. 55-61.
[91] HTK Team, HTK speech recognition toolkit, 2004, http:j jhtk.eng.cam.ac.ukj.
[92] SenseAble technologies, Haptic devices, http:j jwww.sensable.comj.
[93] Henrik Tramberend, Avocado: A distributed virtual reality framework, IEEE Virtual Reality,
IEEE Computer Society, 1999, pp. 14-21.
[94] Matthew Thrk, Perceptual user interfaces, Frontiers of human-centred computing, online
communities and virtual environments, Springer-Verlag, London, UK, 2001, pp. 39-51.
82
REFERENCES
[95] ___ , Gesture recognition, Handbook of virtual environments: Design, implementation,
and applications (K. M. Stanney, ed.), Lawrence Erlbaum Associates, 2002, pp. 223-238.
[96] Minh Tue Vo, A framework and toolkit for the construction of multimodallearning interfaces,
Ph.D. thesis, Carnegie Mellon University, 1998.
[97] Christian Vogler and Dimitris Metaxas, A framework for recognizing the simultaneous aspects
of american sign language, Comput. Vis. Image Underst. 81 (2001), no. 3, 358-384.
[98] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter
Wolf, and Joe Woelfel, Sphinx-4: A flexible open source framework for speech recognition,
Tech. report, Sun Microsystems, 2004.
[99] Colin Ware and Ravin Balakrishnan, Reaching for objects in VR displays: lag and frame
rate, ACM Trans. Comput.-Hum. Interact. 1 (1994), no. 4, 331-356.
[100] Kent Watsen and Michael Zyda, Bamboo - a portable system for dynamically extensible, real
time, networked, virtual environments, Virtual Reality Annual International Symposium,
IEEE Computer Society, 1998.
[101] Alan Wexelblat, Research challenges in gesture: Open issues and unsolved problems, Inter-
. national Gesture Workshop on Gesture and Sign Language in Human-Computer Interaction,
vol. 1371, Springer-Verlag, 1997, pp. 1-1l.
[102] Andrew Wilson, Adaptive models for gesture recognition, Ph.D. thesis, MIT, 2000.
[103] Yaser Yacoob and Michael J. Black, Parameterized modeling and recognition of activities,
Comput. Vis. Image Underst. 73 (1999), no. 2, 232-247, Elsevier Science Inc.
[104] Ming Hsuan Yang, Narendra Ahuja, and Mark Tabb, Extraction of 2d motion trajectories
and its application to hand gesture recognition, IEEE Trans. Pattern Anal. Mach. Intell. 24
(2002), no. 8, 1061-1074.
[105] ZeroC, abject oriented middleware, http://www.zeroc.com/.
[106] Jorg Zieren, Nils Unger, and Suat Akyol, Hands tracking from frontal view for vision-based
gesture recognition, 24th DAGM Symposium Pattern Recognition, vol. Volume Lecture
Notes in Computer Science LNCS 2449, Springer, 2002, pp. 531-539.
83
APPENDIX A
XML Notation
XML (eXtensible Markup Language) is a text format that employs the tagjattributes (or markup)
metaphor in order to represent tree-like data structures. Unlike HTML, tags are not imposed, but
defined by users. Document type definition (DTD) or XML Schema (XSL) specifications define how
data should be structured in a file. With modern parsers akin to Xerces [35], it is possible to validate
an XML file, given a definition file, in order to ensure consistency of the data representation and
report semantic errors in the vaHdated file.
In the framework's implementation, Xerces is used as an XML parser, while the document object
model (DOM) stores the XML data internaIly. DOM converts XML data in a tree-like structure
in which each XML data quanta is a DOM Node. The nodes have children and parents, as weIl as
attributes. To clarify the notation, here is an example of a node and its attributes in XML format:
<nodeName attributel-"value 1" attribute2-"value 2">
<childNode>parsed character data</childNode>
</nodeName>
In the previous example, anode with a tag name "nodeName" has two attributes whose values
are in the string format. In fact, character strings represent every XML node, being of any data
type. An XML document is therefore readable by a human. The "childNode" tag has "nodeName"
as its parent. "Parsed character data" can be retrieved by the user. XML is obviously not the
most compact data format, but it offers much more flexibility sinee the format is known and no
deserialization is needed in order for data to be extracted from a stream.
APPENDIX B
Implemented Components
B.l. Actions
B.1.1. Action "moveCursor".
entity's property.
Assigns the input token's position parameter to the world
• expected properties in the world entity: "position": Vector<AData>
• expected parameter in the input token: "position": Vector<AData>
B.1.2. Action "reset".
have the "deletable" behaviour.
Sets the "render" attribut es to its opposite value, for entities that
• expected properties in the world entity: needs to be the "World" object
• expected parameters in the input token: none
B.1.3. Action "traceCursor". Pushes the "position" vector retrieved from the input
token in the world entity's position list and pops the front item if the list's size is larger than the
"length" property value.
• expected properties in the world entity: "positionList": List<Vector<AData> >,
"length": 1 nteger
• expected parameters in the input token: "position": Vector<AData>
B.1.4. Action "delete". Sets the "render" attribute of the entity to "false".
• expected properties in the world entity: none
• expected parameters in the input token: none
B.l ACTIONS
B.1.5. Action "placeImage2D". Creates a new world entity of type "image2D" using a
factory, then sets its name given an instance number. Add to the newly created entity the behaviours
"deletable", "translationPickable" and "translatable". Properties "ID", "iileName", "position" and
"picked" are added to the new entity, where the "fileN ame" originates from the action data map
and "position" from the input token.
• expected properties in the world entity: needs to be the "World" object
• expected parameters in the input token: "name": St ri ng (optional), "applicationPoints":
Map<AData, AData>
B.1.6. Action "pick". Sets the validated entity's "picked" property to true.
• expected properties in the world entity: "picked": Boolean, "position": Vector<AData>
• expected parameters in the input token: "applicationPoints": Map<AData, AData>
• validation: performs an OpenGL picking to determine which entity the cursor is on, given
the application points. Tries to grab the lock on the picked entity
B.l. 7. Action "drop". Sets the entity's "picked" property to "false" if it was "true" and
releases the lock on the picked entity.
• expected properties in the world entity: "picked": Boolean
• expected parameters in the input token: none
B.1.8. Action "translate". Assigns the input token's "position" attribute to the world
entity's "position" property, being two or three-dimensional.
• expected properties in the world entity: "picked": Boolea n, "position": Vector<AData>
• expected parameters in the input token: "position": Vector<AData>
• validation: an entity is valid if the "picked" property is set to "true"
B.1.9. Action "rotate". Assigns the input token's "position" attribute to the world en-
tity's "rotation" property with a given mapping.
• expected properties in the world entity: "picked": Boolean, "rotation": Vector<AData>
• expected parameters in the input token: "position": Vector<AData>
• validation: an entity is valid if the "picked" property is set to "true"
B.I.I0. Action "undo".
object.
Calls the "undoLastAction" method on the interaction manager
86
B.1 ACTIONS
• expected properties in the world entity: none
• expected parameters in the input token: none
B.1.11. Action "redo".
object.
CaBs the "redoLastAction" method on the interaction manager
• expected properties in the world entity: none
• expected parameters in the input token: none
B.1.12. Action "showSystemlnformation". Assigns the interaction manager's state to
the world entity's "text" property.
• expected properties in the world entity: "text": String
• expected parameters in the input token: none
B.1.13. Action "stateChange".
change in the interaction manager.
Does not perform anything, it is only used for astate
B.1.14. Action "placeMode13D". Creates a new world entity of type "modeI3D" using
a factory, then sets its name given an instance number. Add to the newly created entity the
behaviours "deletable", "translationPickable", "rotationPickable", "translatable", "rotatable" and
"texturable". Properties "ID", "fileName", "position", "rotation", "textures" and "picked" are
added to the new entity, where the "fileName" originates from the action data map and "position"
from the input token.
• expected properties in the world entity: needs to be the "World" object
• expected parameters in the input token: "name": String (optional), "applicationPoints":
Map<AData, AData>
B.1.15. Action "put Texture" . Given the picking results, set the object's texture to the
file name found in the action's data map and unlocks the entity.
• expected properties in the world entity: "textures": Vector<String> , "picked":
Boolean, "position": Vector<AData>
• expected parameters in the input token: "applicationPoints": Map<AData, AData>
• validation: performs an OpenGL picking to determine which entity the cursor is on, given
the application points. Tries to grab the lock on the picked entity
87
B.2 WORLD ENTITIES
B.2. World Entities
B.2.1. Image 2D. Displays a two-dimensional image in an orthographie OpenGL view
("Display2D"). The dimensions of the texture (height and width) should be a power of two. The ren
dering method first loads the texture whose file name is specified in the "fileName" property of type
String. It also places the loaded texture at the specified "position" property of type Vector<AData>.
B.2.2. Model 3D. Displays a 3D Studio Max model in a three-dimensional perspective
view ("Display3D"). The rendering method tries to load the 3D model whose file name is specified
in the "fileName" property of type String. The "position": Vector<AData> property is used to place
the 3D model in the 3D space, whereas the "rotation": Vector<AData> is used to set the rotation
angles, specified in degrees around each of the three axes. The "selected": Boolean property is used
to draw a bounding box around the 3D model, if set to "true". The "textures": Vector<String> is
used to apply textures on the sub-objects of the 3ds model. The OpenSceneGraph version acts the
same way, but displays any kind of supported model in a 3D OpenSceneGraph output modality.
B.2.3. Mouse Cursor. Displays a virtual mouse cursor in an orthographie view ("Dis-
play2D" ). The rendering method places the cursor on the screen according to the "position":
Vector<AData> property. It also assigns the cursor a color set by the "color": Vector<AData>
property. The shape of the cursor is a filled circle. There is also an OpenSceneGraph version of the
mouse cursor.
B.2.4. Mouse Trajectory. Displays a mouse fading trail in an orthographie view ("Dis-
play2D"). Lines are drawn from the mouse cursor according to the "positionList":
Vector<Vector<AData> > with an alpha parameter decreasing to zero for the last segment. The
"color": Vector<AData> property specifies the color in which the fading trail is displayed. There is
also an OpenSceneGraph version of the mouse trajectory.
B.2.5. Text 3D. Displays a three-dimensional string of text in a "Display3D". The character
string is set by the property "text": String, whereas the font file name is specified by the property
"fontName": String. The location of the text is set by the "position": Vector<AData> property.
There is also an OpenSceneGraph version of the 3D text.
88
BA OUTPUT MODALITIES
B.3. Input Modalities
B.3.1. Glove-based Gesture Recognition.
be recorded or played back.
Emitted tokens:
Interfaces a P5 glove whose input data can
• "moveCursor", which contains the "position": Vector<AData> parameter
• "still", emitted after half a second of absence of movement with a threshold specified in
the configuration file. The token parameters contain the gesture application point
B.3.2. Mouse-based Gesture Recognition. Interfaces a mouse with the RAWInput
API, whose input data can be recorded and played back.
Emitted tokens:
• "moveCursor", which contains the "position": Vector<AData> parameter
• "still", emitted after half a second of absence of movement. The token parameters contain
the gesture application point
B.3.3. Vision-based Gesture Recognition. Interfaces the video tracker which sends
hand positions through the network. Those positions can be recorded and played back.
Emitted tokens:
• "moveCursor", which contains the "position": Vector<AData> parameter
• "still", emitted after half a second of absence of movement with a threshold specified in
the configuration file. The token parameters contain the gesture application point
B.4. Output Modalities
B.4.1. Display 2D. Sets an orthographie view in screen coordinates.
B.4.2. Display 3D. Sets a perspective view with a viewing angle of 30 degrees. Translates
the entire world, such that the (0, 0) coordinate is at the center of the screen.
B.4.3. Open Scene Graph Display 2D. Adds an orthographie projection bran ch to the
rendering graph in order to add world entities that should be displayed on the screen plane.
B.4.4. Open Scene Graph Display 3D. Adds a position-attitude transform no de to the
rendering graph.
89
APPENDIX C
Sample XML Configuration File
<IIMF>
< InputHodalities >
< InputModali ty name-"mou8eGestures Il type-"DynamicGestures" >
<AXHL>
<map name="instance">
<string name-"type" value="MouseBasedHMMGestureRecognizer" />
<map name="data">
<string name-"dataFile" value-"mouseGesture-1arge.1"1>
<int name-"smoothingBuffer" value- n 3"/>
<int name-"buffer" value-"100"1>
<boc! name-"useContext" value-"true" />
<string name-"logFile_" value-"mouseGesture..recogni tionTestl.log. 13"1>
<string name-"logDirection" value-"input"/>
<boal name-"logNeedSleep" value-11true ll />
<boa! name-ntraining" value-"true" 1>
</map>
</map>
<string name-"mode" value-"events" />
<int name="frameRate" value="30 1' />
<string name="logFile_ Il value-"test .log" />
</AXML>
</InputHodality>
</InputHodali ties >
<OutputHodali ties >
<OutputHodali ty name-"canvas" type-"Display2D" >
< AXMl > <int name-"frameRate" value-"30"/>
</AXHL>
</OutputModality>
<OutputModality name-"userInterface" type-"Oisplay20">
< AXMl > <1nt name="frameRate" value="30"/>
<string name="behaviour" value="drawOnTop" />
< IAXML >
APPENDIX C. SAMPLE XML CONFIGURATION FILE
</OutputModali ty>
<OutputModality name-"3DEnvironment" type-"Display3D" >
<AXML>
<int name-"frameRate" value-" 30" 1>
< I!XML > </OutputModali ty>
</OutputModali Ues>
<World>
<Hook name-"Grid">
<AXML>
<int name-nIineNumber" value-"SO"/>
</AXML>
</Hook>
<Bahaviours>
<Behaviour name-"res.table" />
<Behaviour name-"placeable" />
<Bahaviour name-"undoabla" 1>
<Behaviour name-"redoable" 1>
<Behaviour name-"textureEnabled "1>
< IBehaviours >
<WorldEnti ty name-"text" type-"text3D" > < !XML >
<string name-"fontName" value-"Teen.ttf"/>
<string name-"text" value-"State information" />
<vector name-"posi tion" > <int name="x" value="-100" 1>
<int name="y" value="100"/>
<int name="z" value="-400" />
</vector>
< I!XML > <Behaviours>
<Behaviour name-" inf ormable " 1>
</Behaviours>
</WorldEntity>
<WorldEntity name-"chair" type-"mode13D">
<!XML>
<string name-"fileName" value-" .. \Data\Models\chair.3ds"l>
<vector name-"position">
<int name-"x" value-"O"/>
<int name="y" value="200"1>
<int name="z" va!ue="O"/>
</vector>
<vector name-"rotation">
<int name-"x" value_IIO" />
<int name-"y" value- II O"/>
<int name-"z" value"""'O"/>
</vector>
</!XML>
<Behaviours>
<Behaviour name-"deletable"l>
<Behaviour name-"selectable">
91
APPENDIX C. SAMPLE XML CONFIGURATION FILE
<AXML>
<boal name-"selected" value-"false" />
</AXML>
</Behaviour>
<Behaviour name-"translationPickable" />
<Behaviour name-"translatable" >
<AXML>
<string name-"ID" value-"mouseO" />
</AXML>
</Behaviour>
<Behaviour name-"rotationPickable"/>
<Behaviour name-tlrotatable" >
<AXML>
<string name="ID" value="mouseO"/>
</AXML>
</Behaviour>
<Behaviour name-"texturable"/>
</Behaviours>
</WorldEnti ty>
<WorldEntity name-"image" type-"image2D">
<AXML>
<string name-"fileName" value-" .. \Data\ test. bmp" />
<vector name-"position">
<int name-"x" value-"150"/>
<int name-"y" value-"150"/>
</vector>
</AXML>
<Behaviours>
<Bahaviaur name-"selectable" />
<Bahaviaur name-"translatable">
<AXML>
<string name-"IO" value-"mouseO" />
</AXML>
</Behaviour>
<Behaviour name-"translationPickable" />
<Bahaviaur nam.e- n scalabla" />
<Behaviour name-"deletable"/>
</Behaviours>
</WorldEntity>
<WorldEnti ty name=" cursorLeft Il type="mouseCursor n > <Behaviours>
<Behaviour name-"mouseMovable">
<AXML>
<string name-"ID" value-"mouseO" />
</AXML>
</Behaviour>
<Behaviour name-"drawOnTop" />
< /Behaviours >
< AXML >
<map name-"color">
<int name-"r" value- 1I 25S"/>
92
APPENDIX C. SAMPLE XML CONFIGURATION FILE
<int names"g" value="255" />
<int name-"b" value_IIO" />
</map>
</AXML>
</WorldEnti ty>
<WorldEnti ty name-"trajectoryLeft" type-"mouseTraj ectory" >
<AXHL>
<map name-"color">
<int name-"r" value-"255" />
<int name-"g" value-"255"/>
<int name-"b" value-" 0 " />
</map>
<int name="length" value="100"/>
</AXML>
<Behaviours>
<Sehaviour name-"mouseTraceable" >
<AXML>
<string name-"IDn value-"mouseO" />
</AXML>
</Behaviour>
<Behaviour name-"drawOnTop" 1>
< /Behaviours >
</WorldEntity>
<WorldEnti ty name-"traj ectoryRight" type-"mouseTrajectory" >
<AXML>
<map name="color">
<int name="r" value="O" />
<int name="g" value="255"/>
<int name-"b" value-IIO"I>
</map>
<int name-"length" value-" 100" />
</AXML>
<Behaviours>
<Behaviour name-"mouseTraceable" >
<AXML>
<string name-"ID" value-"mouse1 "I> </AXML>
</Behaviour>
<Behaviour name-"drawOnTop" />
</Behaviours>
</WorldEnti ty>
<WorldEntity name-"cursorRightIl type-"mouseCursor">
<AXML>
<map name:"color">
<int name-"r" value-"Q"/>
<int name-"g" value-"255"/>
<int name-"b" value-"O"/>
</map>
</AXHL>
<Behaviours>
<Behaviour name-"mouseMovable">
93
APPENDIX C. SAMPLE XML CONFIGURATION FILE
<llMI.>
<string name-"ID" value-"mousel"l>
</AXML>
</Behaviour>
<Behaviour name-"dravOnTop"l>
< IBehaviours >
</WorldEnti ty>
</World>
<Grammar>
<Action type-Ilreset ll activationToken-"omega" when-"id!e" > <Behaviour name-"resetable"l>
</Action>
<Action type-"placeHarker2D_" activationToken=lIplacePoint" when:" idle" > <Behaviour name="placeable"l>
</Action>
<Action type-"stateChange" activationToken-"coeur" when-"idle" becomes·"placing">
<Behaviour name-"placeable"l>
</Action>
<Action type-"placeMode13D" activationToken-"croix" vhen-"placing" > < AXML >
<string name-"ID" value-"mouseOII/>
<string name-"fileName" value-no .\Oata\Models\chair.3ds"l>
</AXML>
<Behaviour name-"placeable" 1>
</Action>
<Action type="placeModel30" activationToken="triangle" vhen="placing">
<AXML>
<string name="ID" value-"mouseO" />
<string name-"fileName" value-" .. \Oata\Models\fridge. 3ds"l>
< IAXML > <Behaviour name-"placeable"l>
</Action>
<Action type-"placeMode13D" activationToken-"carre" vhen-"placing">
<AXML>
<string name-"ID" value-"mouseO"/>
<string name-"fileName" value-" .. \Oata\Models\plantOl. 3ds" 1>
</AXML>
<Behaviour name-"placeable" />
</Action>
<Action type=" stateChange" activationToken=" coeur" when="placing" becomes=" idle" > <Behaviour name="placeable" 1>
</Action>
<Action type- l placelmage2D" activationToken-"croix" when-"idle">
<AXML>
<string nam.-"ID" value,.lImouseO"/>
<string name-"fileName" value-" .. \Oata\test.bmp"l>
</AXML>
<Behaviour name-"placeable" />
</Action>
<Action type-"delete" activationToken-"delete" when-"idle">
<Behaviour name-"deletable" >
94
APPENDIX C. SAMPLE XML CONFIGURATION FILE
<AXML>
<vector name-"posi tian" />
</AXML>
</Behaviour>
</Action>
<Action type-11moveCursor" activationToken-"moveCursor" when-"anyIl > <Behaviour name- "mouseMovable Il >
<AXML>
<vector name-"position">
<int nama-"x"l>
<lnt name-Hy"/>
</vactor>
</AXML>
</Bahaviour>
</Action>
<Action type-11traceCursor" activationToken-"moveCursor" when-" any" > <Behaviour name-"mouseTraceable" >
<AXML>
<list nama-"posi tionList" >
<vactor nama-"position">
<int nama-"x" 1>
<int nama-"y"l>
</vactor>
</list>
</AXML>
</Behaviour>
</Action>
<Action type="translate" activationToken="moveCursor" when="translating">
<Behaviour name-"translatable" >
< AXML >
<vector name-"position"/>
</AXML>
</Bahaviour>
</Action>
<Action type-"rotate" activationToken-"moveCursor" when-"rotating">
<Behaviour name-"rotatabla">
<AXML>
<vector name-"rotation" />
</AXML>
</Bahaviour>
</Action>
<Action type-lipick" activationToken-"whi te" when- II idle" becomes·lltranslating" > <Behaviour name-"translationPickable" >
<AXML>
<vector name-"posi tion" />
<bool name-"pickad"l>
</AXML>
</Bahaviour>
</Action>
<Action type-"pick" activationToken-"carre" when-"idle " becomes-"rotating">
<Bahaviour name-"rotationPickable" >
95
APPENDIX C. SAMPLE XML CONFIGURATION FILE
<AXML>
<vector name-"position"/>
<bool name-"picked"l>
< IAXML > < IBehaviour >
</Action>
<Action type-"stateChange" activationToken-"texture" when-"idle " becomes-"texturing ll > <Behaviour name-"textureEnabled" 1>
</Action>
<Action type-"stateChange" activationToken-"texture" when-"texturing" becomes-"idle" > <Behaviour name-"textureEnabled"l>
</Action>
<Action type="putTexture" activationToken="croix" when="texturing">
<AXML>
<string name-"textureName" value-" .. \Data\Models\03700447. bmp" 1>
</AXML>
<Behaviour name-"texturable">
<AXML>
<vector name-"textures" />
</AXML>
</Behaviour>
</Action>
<Action type-"putTexture ll activationToken-"start" when-"texturing" > <AXML>
<string name-HtextureName" value-" .. \Data\Models\dchrfab. bmp"l>
</AXML>
<Behaviour name="texturable">
<AXML>
<vector name-"textures" />
< IAXML > < IBehaviour >
</Action>
<Action type-"putTexture" activationToken-"stop" when-"texturing">
<AXML>
<string name-HtextureName" value-" .. \Data\Models\couch. bmp\ HI>
</AXML>
<Behaviour name-"texturable">
<AXML>
<vector name-"textures" />
</AXML>
< IBehaviour >
</Action>
<Action type-l'drop" activationToken-"still" when-"translating" becomes-"idle">
<Behaviour name-"translationPickable" >
<AXML>
<vector name-"position"/>
<bool name-"picked" 1>
</AXML>
< IBehaviour >
</Action>
<Action type-"drop" activationToken-"still" when-"rotating" becomes-"idle">
96
APPENDIX C. SAMPLE XML CONFIGURATION FILE
<Behaviour nama-"rotationPickabla">
<AXML>
<vector name-"position"/>
<bool name-"picked" 1>
< IAXML > </Behaviour>
</Action>
<Action type-"drop" activationToken-"dropll when-"any" becomes-"idle">
<!-- This action resets the current action status of the entity: it
sets its statua ta idle and does nothing more ... --> </Action>
<Action type-"undo" activationToken-"pointe" when-"idI." becomes-"idle" > <Behaviour name="undoableIl />
</Action>
<Action type-"redo" activationToken-Itpointe_inv" when-nidl." becomes·"idle">
<Behaviour name-"redoable 1' />
</Action>
<Action type-"shovSystemInformation" activationToken-lisystemlnfo li when-"any">
<Behaviour name-"informable">
<AXML>
<string name-"text" />
</AXML>
</Behaviour>
</Action>
<Action type-"stateChange" activationToken-"white" when-"placing">
<Behaviour name="placeable"l>
< ! -- Just a dlllllllly action that acts in f act as a garbage model. .. -->
</Action>
<Action type-"stateChange" activationToken-"white" when-"texturing">
<Behaviour name-"textureEnabled" 1>
<! -- Just a dummy action that acts in fact as a garbage model. .. -- >
</Action>
< IGrammar > <Network>
<Connection type-"server" port-"76849" 1>
</Network>
</MMF>
97
APPENDIX D
User Mannal
D.l. Introduction
The current software framework is intended to be used for multimodal applications as a virtual
world model and standard interfaces for input and output modalities. Several different modalities
were implemented, namely mouse, glove and vision-based gesture recognition systems on the input
side, and OpenGL and OpenSceneGraph on the output side. This manual first describes the pre
requisites and installation instructions. Directions on how to adapt the framework for specifie needs
are then presented, while stating concrete examples throughout the description.
D.2. Prerequisites
The software framework depends on several freely available software libraries, which provide
classes that implement mechanisms for improved generality. The following list states those prereq
uisites:
• ACE OS Wrapper: C++ library that acts as an operating system wrapper in order for
users to write OS-independent code. ACE provides classes for threads, network sockets
as weIl as many other functions that would not otherwise be standard on every operating
system
• Xerces-C++: portable XML reader that provides a DOM representation of a file, and
utilities in order to retrieve the different parameters and attributes
• LTI-Lib: portable C++ library that implements mathematical operations commonly used
in computer vision and artificial intelligence
D.3 INSTALLATION
• OpenGL: portable environment for developing 2D and 3D graphies applications
• FTGL: portable library used to display fonts in a three-dimensional OpenGL window
• AData: portable C++ library that acts as a common generic data format in the entire
framework
• Server (optional): C library employed for communication between the vision-based tracker
available in the SRE and the corresponding input modality implemented in the current
framework
• OpenThreads, Producer and OpenSceneGraph (optional on Windows): set of portable
C++ libraries that implement a scene graph and display system in order to represent data
that is to be rendered by OpenGL. It also provides utilities needed to manage the mouse
and keyboard interfaces
D.3. Installation
The software framework was tested on Windows XP and the Fedora Core 3 Linux distribution.
The general components are compiled and linked in a shared library in order to be used by your
application, but the compilation process that will be described in the following sections is different
for the two platforms.
D.3.1. Microsoft Windows Installation. Prerequisites: Download, compile and install
the aforementioned libraries. Be sure to add to the "PATH" environment variable to every directory
in which the binary files of each library are located.
Software framework:
(1) Get the source code, project and data files from the CYS repository
(2) Open the solution "MultimodaIFramework.sln" with Microsoft Visu al Studio .NET or
later1
(3) Choose "Build", "Batch build" select every project and click "Build"
(4) No compilation or linking errors should be encountered, but if it is not the case, check the
"include" and "library" paths that should be composed of the ones associated with the
different prerequisite paths. Those settings can be added in the menu: "Tools", "Options",
"Projects", "VC++ Directories"
1 Visual Studio 6.0 will not compile with the LTI-Lib because of heavy use of templates which are not correctly supported in this old version.
99
D.4 HOW TO EXTEND THE FRAMEWORK?
(5) Two applications are available that can be set as the start-up project: an MFC-based
user interface (Project "MultimodaIFramework") or an OpenSceneGraph-based software
(Project "OSGFramework")
(6) Press "F5" for debugging the application or "CTRL-F5" to run the program
D.3.2. Linux Installation. Prerequisites: Download, compile and instaIl the aforemen-
tioned libraries. If the instaIled shared libraries are not located in standard directories, be sure to
add their paths to the "LD-LIBRARY J>ATH" environment variable, or to the "/etc/ld.so.conf" file
and run the "ldconfig" command subsequently.
Software framework:
(1) Get the source code, Makefiles and data files from the CYS repository
(2) Type "make aIl" in order to build every implemented shared and dynamic library as weIl
as an OpenSceneGraph-based application
(3) If there are compilation or linking errors, check the include and library paths that should
be composed of the prerequisite libraries' paths
(4) In order to run the application, change the directory to "OSGFramework" and run "./OS
GFramework"
0.4. How to Extend the Framework?
You have to adapt several software components, summarized in Table D.l, for your specific
applications.
D.4.1. Input Modality. In order to initiate the data pipeline, you have to instantiate input
modalities. The currently implemented input modality is a continuous dynamic gesture recognizer
that interfaces a mouse, a data glove and a video-based tracker. Other examples of input modalities
are static gesture recognizer, speech recognizer, gaze tracker or any other device that would provide
information on the user's status.
In order to specialize an input modality, the foIlowing methods are available for overloading:
• init(const Data::AData &data): provides initialization data that you specify either from an
XML file or hard coded values
• finiO: acts as a termination method in order to clean up dynamically aIlocated objects
100
DA HOW TO EXTEND THE FRAMEWORK?
Component Purpose Overloaded methods (italic = abstract)
Input modality Get data from logical entities, being in init, emitToken, fini, start, the real or virtual world stop
Output modality The output side of the stream, instan- init, fini, renderWorld tiated in order to render virtual data in the real world
World entity Logical entity representing a virtual ob- init, fini, render ject that owns properties and shows be-haviours
Action Logical entity that acts on world en- doApply, validateEntities tities by changing their properties ac-cording to the action's specifications
World ho ok Contain data that is rendered by output init, render modalities, on which input modalities do not have any influence
TABLE D.l. Summary of software components that need to be specialized
• startO: typically activates the input modality's thread
• stopO: typically terminates the input modality's thread
• emitToken(lnputToken *token): normally called to set modality-specific parameters in the
input token and add it to the interaction manager's queue. The input token is created in
the modality's thread, from which the emitToken calI originates
D.4.2. Output Modality. In order to render virtual objects in the real world, you have to
instantiate output modalities. The currently implemented output modalities are 2D and 3D views,
using either pure OpenGL primitives or OpenSceneGraph for data structure and rendering man-
agement. Other possible output modalities are sound output, haptics devices or more sophisticated
projection systems.
In order to specialize an output modality, the following methods are available for overloading:
• init(const Data::AData &data): provides initialization data that you specify either from an
XML file or hard coded values
• finiO: acts as a termination method in order to clean up dynamically allocated objects
• renderWorldO: typically invoked to perform entity-independent operations that must be
performed before calling the render method on every world entity, which is the default
behaviour of this method
101
D.4 HOW TO EXTEND THE FRAMEWORK?
D.4.3. World Entity. A virtual world is composed of world entities that are to be added
during the initialization stage or at runtime by corresponding actions. An entity is defined as a
logical representation of an object. It contains named properties in the form of AData, which are
generic data containers, removing the type constraint. Typical world entities can be a 3D model, a
2D image, a mouse cursor, a 3D sound object or a bumpy virtual surface.
In order to specialize a world entity, the following methods are available for overloading:
• init(const Data::AData &data): provides initialization data that you specify either from an
XML file or hard coded values
• finiO: used as a termination method in order to clean up dynamicaIly aIlocated objects
• render(OutputModality *modality): perform the function caIls needed to render the world
entity in the given output modality. A dynamiccast operation is typicaIly performed on
the modality pointer in order to discover in which modality type the rendering calls are to
be made
D.4.4. Action. Actions are instantiated in order for data originating from input modalities
to affect the world entities such that they will influence properties, given data packed in an input
token. Typical examples of actions can be placing a 3D model, translating an entity, rotating an
entity and changing the color or texture of a model.
In order to specialize an action, the foIlowing methods are available for overloading:
• doApply(WoridEntity *entity, InputToken *token): modifies the entity's properties according
to data located in the input token. The doApply method is called if the entity's behaviours
correspond to the action's requirements, which is verified in the interaction manager. A
more detailed description of this pro cess is beyond the scope of this user manual and can
be found in Section 4.6
• validateEntities(std::list<WoridEntity*> &entitiesList, InputToken *token): used to select
amongst a list of possible entities the ones to which the action has to be applied, given
an input token. A validation method calI typically occurs when entities from the virtual
world have to be picked by a user pointing to a specifie location
102
D.5 PUTTING IT ALL TOGETHER
D.4.5. World Hook. A virtual world might contain static data, which is to be rendered
in the real world. A typical example of a world hook could be the virtual room in which you are
located, or a background sound over which you do not have any influence.
In order to specialize a world entity, the following methods are available for overloading:
• init(const Data::AData &data): provides initialization data to the hook that you specify
either from an XML file or hard coded values
• render(OutputModality *modality): renders the ho ok in the specified output modality. A
dynamiccast operation on the modality parameter is typically performed in order to ensure
that the hook can be rendered
D.5. Putting it AlI Together
In order to use the framework effectively, the specialized components detailed above have to be
instantiated, initialized and started. In order to facilitate this process, you can write an XML file
that specifies the type and parameters of every software component. The only object that you need
to instantiate explicitly is one of type Instance. A call to the init method with an XML file path
as an argument then creates and initializes the components. A subsequent call to the start method
activates the input modalities, hence making the data flow through the pipeline until reaching the
output modalities. There is however one subtle technicality with the latter class objects. In the
majority of display software, the rendering methods are always called from the main loop. Therefore,
you must first retrieve the instantiated output modalities from the OutputManager object and then
call the renderWorid method on each of them, in the main program loop.
The objects instantiation relies on factories, which know how to create concrete instances. The
specific factories have to overload the following methods: createlnputModality, createOutputModality,
createWorldEntity and createAction, each taking a character string as an argument and returning
a pointer to the newly created object or the NULL pointer. A user-defined factory is registered
automatically in the factory manager as soon as one instance of it is created.
For increased flexibility, you can build shared libraries that will be loaded dynamically at
runtime. It is particularly interesting to employ that scheme in order to limit the dependencies
between the different software modules. The object factories try to open a shared library that has
103
D.6 HOW TO USE THE GESTURE RECOGNIZER?
the same name as the requested object type. 2 When the library loads successfuBy, the factory looks
for the ObjectFactory symbol that you must define using the GENERILFACTORY_MACRO(CHILD,
PARENT), where CHILD is the child class type and PARENT the parent type, which is to be returned
by the ObjectFactory function.
D.6. How to Use the Gesture Recognizer?
A continuous gesture recognizer was implemented as a proof of concept for the current frame-
work. Since it is the most important feature that has to be controBed interactively, this section
shows you how to train gesture models and how to obtain reasonable recognition results afterwards.
D.6.!. Training Procedure. In order to interact with the software, you use the keyboard
as an input to the gesture trainer. The available commands are summarized in Table D.2.
Command Parameters Purpose f File name (op- When a parameter is specified, this command sets
tional) the gesture database file name to the provided pa-rameter. Otherwise, it prints the current file name
1 none Lists the currently available models and IDs m Model ID (op- When a parameter is specified, this command sets
tional) the current model ID to the provided parameter. Otherwise, it prints the current model ID
0 none Opens the file whose name was specified by the file name command and loads the gesture models
r none Resets the current model training data R none Resets aB models and training data s none Saves the models in the file whose name is specified
by the set file name command t on/off (op- Toggles the training/recognition status or sets it to
tionaI) the value specified as a parameter ? none Displays a help message describing aB the available
commands return none Starts or stops the training of a gesture whose model
ID was specified by the corresponding command
TABLE D.2. Summary of the gesture trainer keyboard interface commands
The training stage that you must perform in order to obtain gesture models is actually managed
by the preceding commands. You add new gestures with the "m" command and start capturing
data by pressing the <RETURN> key, and another one to stop the capture and perform a training
2 A "d" is concatenated at the end of the class name if the _DEBUG flag is set.
104
D.6 HOW TO USE THE GESTURE RECOGNIZER?
pass. Normally, you should perform in the order of twenty gestures to acquire reliable models. You
can see in the output console the average score that results in performing the recognition on all
the training sequences provided, and the number of gesture samples involved in training the current
model. To ensure a properly trained model, be sure that the score is not in a local minimum, and
continue adding training sequences until it converges to a reasonable value. You should also make
sure that you train the models with sufficiently sparse data, such that a minimal number of errors
will occur during the recognition stage. The following describes the steps that should be followed in
order to achieve a complete training procedure:
• Type the command "R" in order to reset the current models
• Repeat until every gesture model is trained:
• Type "m <gestureID>" in order to set the current gesture ID
• Repeat until the current model has converged:
• Press <RETURN> to start capturing positions
• Perform the actual gesture with the position capturing device
• Press <RETURN> again to stop capturing positions
• Type "f <filename>" to set the file name in which the models will be saved
• Type "s" to save the models in the file
When the models have already been saved in a file and you want to load them in your gesture
database, you have to follow this procedure:
(1) If you do not want to keep the gestures that are already in your database, type the
command "R", which resets and removes every model
(2) Type "f <filename>" to set the file name that you want to load
(3) Type "0", which will open the file and load the gesture models in your database
D.6.2. Recognition Procedure. After having trained your gesture models, you are now
ready to perform recognition with positions that you continuously provide. The recognition algo
rithm will spot gestures over time and emit tokens when appropriate conditions are met. You should
issue the command "t off" in order to initiate the recognition algorithm.
In order to obtain the best recognition results from your sequences:
105
D.7 TROUBLESHOOTING AND ADVICES
• choose gestures that are dissimilar for multiple actions that can be applied concurrently
to one object.
• when a deletion error occurs, repeat the gesture several times. If it is still not being
spotted, move away from the target point and return to repeat the gesture. Moving away
should reset the wrong hypotheses that were confusing the recognizer.
• when a substitution or insertion error occurs and results in an incorrect action invocation,
use the undo feature.
• ensure that the gesture is performed precisely on top of the object to which you want to
apply when the environment is cluttered with several objects. Since the gesture spotting
is performed automatically, the starting point is sometimes falsely identified, which results
in incorrect selection of the entity.
D.7. 'Iroubleshooting and Advices
Several runtime errors can occur when an instance is started. The best way to retrieve these
is to track them in the output console, which displays several messages as to where an error could
have occurred. There is also a convention in the return value of several interface methods that you
should observe carefully. Most of the interface methods return an integer value that contains an
error message. A value of "0" means that the method completed correctly. A value of "-1" means
that an error occurred and that you should abort the current procedure and fix it. Other values are
reserved for other error messages, for which a convention should be applied.
Logging facilities are also provided in order to output data coming from input modalities to a
file. Reading back the file and using the data as input provides a way to capture data only once and
to tweak the different parameters and grammar thereafter.
Finally, when you create new classes that extend the framework, be sure to respect the stan
dards that were set by the author. Review the code that implements input and output modalities,
actions, world entities and world hooks, and adopt the same coding style. The source code is amply
commented, which should guide you in writing applications that are based on the current framework.
106
Top Related