Download - François Riouxsrl.mcgill.ca/publications/thesis/2005-MASTER-Rioux.pdf · 2011. 12. 6. · Abstract Ruman-computer interaction (Rel) is a research topie whose eventual outcome will

SOFTWARE FRAMEWORK FOR PARSING AND INTERPRETING

GESTURES IN A MULTIMODAL VIRTUAL ENVIRONMENT

CONTEXT

François Rioux

Department of Electrical and Computer Engineering

McGill University, Montréal

June 2005

A thesis submitted to the Faculty of Graduate Studies and Research

in partial fulfilment of the requirements of the degree of

Master of Engineering

© FRANÇOIS RIOUX, 2005

1+1 Library and Archives Canada

Bibliothèque et Archives Canada

Published Heritage Branch

Direction du Patrimoine de l'édition

395 Wellington Street Ottawa ON K1A ON4 Canada

395, rue Wellington Ottawa ON K1A ON4 Canada

NOTICE: The author has granted a nonexclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or noncommercial purposes, in microform, paper, electronic and/or any other formats.

The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

ln compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis.

While these forms may be included in the document page cou nt, their removal does not represent any loss of content from the thesis.

• •• Canada

AVIS:

Your file Votre référence ISBN: 978-0-494-22666-7 Our file Notre référence ISBN: 978-0-494-22666-7

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats.

L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse.

Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant.

Abstract

Ruman-computer interaction (Rel) is a research topie whose eventual outcome will allow users of

computer systems for more natural interfaces than the traditional keyboard and mouse. Ideally,

those interfaces would exploit the same communication channels used in everyday life, which are

speech, gestures or any other expressive feature of the human body. In this thesis, a continuous

dynamie gesture recognition system is presented. Positioning devices such as a mouse, a data glove

or a video camera are used as input streams to the recognition module, from which it extracts

the most interesting features. Gestures are recognized continuously, which means that no prior

temporal segmentation is necessary. Training facilities are also available in order to build gesture

models from known segmented gesture sequences. In order for users to effectively reuse code and

build new modules that share common interfaces, a software framework was built that allows for

multimodal inputs and outputs as well as configuring the virtual world and how will data coming

from the real world influence virtual world entities. The data flow originating from the real world

uses a common data format that is standard from the configuration file to the network packets. The

virtual world is modeled such that the actions that affect the virtual entities given input data from

the real world is configurable and extensible. Sharing the environment through the network is also

possible, allowing users from different locations to work on the same virtual world. Preliminary

tests on the gesture recognition performance are presented given several different input modalities

and setups. An experimental application is also described, showing the flexibility and extensibility

features of the software framework.

Résumé

L'interaction homme machine (IHM) est un domaine de recherche duquel les travaux résultants

permettront dans le futur de fournir aux utilisateurs de systèmes informatiques des interfaces dont

l'utilisation sera plus naturelle. Ces interfaces devront idéalement se servir des mêmes moyens de

communication qui sont utilisés dans la vie de tous les jours, soit la parole, les gestes ou autres

expressions corporelles. Dans le cadre de ce mémoire, un module de reconnaissance de gestes dy

namiques continus est présenté. Il permet de reconnaître des gestes exécutés avec des instruments

de suivi de positions et ce, sans segmentation temporelle préalable. Un module d'entraînement

de gestes est également disponible dans le but de construire des modèles de gestes à partir de

séquences préalablement segmentées. Dans le but de fournir aux utilisateurs du logiciel une flex

ibilité d'utilisation accrue ainsi que la possibilité de rajouter d'autres modalités d'entrée et sortie

au système, un framework logiciel a été implémenté. Ce logiciel intégré modélise le monde virtuel

de manière à ce que non seulement les données soient standardisées dans tout le flot de données,

mais également afin de faciliter l'exécution d'actions déclenchées par des données provenant du

monde réel sur des entités du monde virtuel. Le logiciel rend aussi possible la communication entre

plusieurs nœuds d'un réseau, donnant aux utilisateurs le loisir de partager leur monde virtuel. Des

tests préliminaires sur la performance du module de reconnaissance de gestes vis-à-vis différentes

modalités d'entrée sont présentés ainsi qu'une application mettant en évidence les caractéristiques

de flexibilité et extensibilité du logiciel intégré.

Acknowledgements

1 would like to thank my parents for the values they inferred to me, especially respect, excellence

and hard work. 1 would also like to thank my supervisor Dr. Jeremy R. Cooperstock for giving

me the opportunity to take part of the SRE research group and for reviewing my thesis. 1 also

acknowledge his financial support. 1 would also like to thank Dr. Denis Laurendeau and Dr.

Alexandra Branzan Albu who welcomed me and supervised my work at Laval University during

the fall 2004 semester as an exchange student in the course of the QERRAnet program. Thanks

to Frank and Mike for correcting my thesis and with whom 1 really enjoyed working on the famous

"Modellers' Apprentice" table. 1 would also like to thank my brothers (Réjean, Normand, Alain,

Gervais) and friends (Charles, Tardif, Ben, Phil, Filteau, Louis, PO) for the fun we have outside

school. Special thanks to Marie-Ève for her advices and perpetuaI smile. Finally, thanks to the

NSERC for its financial support.

TABLE OF CONTENTS

Abstract .

Résumé ..... .

Acknowledgements .

LIST OF FIGURES

LIST OF TABLES . . . . . .

CHAPTER 1. Introduction

1.1. Context of Research

1.2. Research Problem.

1.3. Thesis Roadmap .

CHAPTER 2. Literature Review

2.1. What Is a Gesture? ....

2.2. Motion Capture Hardware .

2.3. Gesture Recognition Algorithms .

2.4. Virtual Environment Software Architectures.

2.5. Design Goals . . . . . . . . . . . . .

CHAPTER 3. Gesture Recognition Module

3.1. Introduction to Hidden Markov Models

3.2. Choosing the Feature Vector .

3.3. Training Algorithms . . . . .

3.4. Continuous Gesture Recognition

ii

Hi

ix

xi

1

1

2

3

4

4

6

7

9

11

12

12

14

16

17

3.4.1. Hypotheses Generation Aigorithm

3.4.2. Gesture Spotting Algorithm

3.5. Gesture Input Modalities ....

3.5.1. Mouse-based Gesture Recognition

3.5.2. Glove-based Gesture Recognition

3.5.3. Vision-based Gesture Recognition

3.6. Choosing Gestures

3.7. Conclusion.....

CHAPTER 4. Software Framework . . . . . . . . .

4.1. Overview of the Framework's Architecture.

4.2. AData, Generic Data Container .

4.3. Modalities.......

4.3.1. Input Modalities .

4.3.2. Output Modalities .

4.4. Input Tokens Principle .

4.4.1. Input Token Example

4.5. World, World Entities and How to Manage Them .

4.5.1. Components Description

4.5.2. Examples....

4.6. Interaction Manager

4.7. Taking Advantage of the Context

4.7.1. Example ...

4.8. Network Manager.

4.9. XML Configuration File

4.9.1. Input Modality Node

4.9.2. Output Modality Node

4.9.3. World Node

4.9.4. Action Node

4.9.5. Network Node

4.9.6. Discussion ..

TABLE OF CONTENTS

18

19

20

20

21

22

23

24

26

27

29

31

31

33

34

36

37

37

38

39

43

45

45

50

51

52

52

54

54

55

v

TABLE OF CONTENTS

4.10. Conclusion......................................... 55

CHAPTER 5. Results and Discussion .................... .

5.1. Continuous Dynamic Gesture Recognition under Several Conditions

5.1.1. Choice of Feature Vector ........ .

5.1.2. Mouse-Based Gesture Recognition Rate.

5.1.3. Glove-Based Gesture Recognition Rate .

5.1.4. Vision-Based Hand Gesture Recognition

5.2. The Context Grabber's Influence on the Recognition Rate

5.3. The Experimental Application ............... .

5.4. Framework's Performance With a Large Number of Entities

5.5. General Discussion and Limitations. . . . . . . . . . . . . .

CHAPTER 6. Conclusion and Future Work

6.1. Conclusions.

6.2. Future Work

56

56

57

60

61

63

63

64

67

70

72

72

73

REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75

APPENDIX A. XML Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84

APPENDIX B. Implemented Components

B.l. Actions ......... .

B.l.l. Action "moveCursor"

B.1.2. Action "reset" ....

B.1.3. Action "traceCursor"

B.1.4. Action "delete" ...

B.1.5. Action "placeImage2D" .

B.1.6. Action "pick"

B.1.7. Action "drop"

B.1.8. Action "translate"

B.1.9. Action "rotate"

B.1.10. Action "undo"

85

85

85

85

85

85

86

86

86

86

86

86

vi

B.l.1l. Action "redo" .......... .

B.1.12. Action "showSystemInformation"

B.1.13. Action "stateChange" .

B.1.14. Action "placeMode13D"

B.1.15. Action "put Texture"

B.2. World Entities

B.2.1. Image 2D

B.2.2. Model 3D

B.2.3. Mouse Cursor

B.2.4. Mouse Trajectory

B.2.5. Text 3D . .

B.3. Input Modalities

B.3.1. Glove-based Gesture Recognition

B.3.2. Mouse-based Gesture Recognition .

B.3.3. Vision-based Gesture Recognition

B.4. Output Modalities

B.4.1. Display 2D .

B.4.2. Display 3D .

B.4.3. Open Scene Graph Display 2D .

B.4.4. Open Scene Graph Display 3D .

APPENDIX C. Sample XML Configuration File

APPENDIX D. User Manual

D.1. Introduction

D.2. Prerequisites

D.3. Installation.

D.3.1. Microsoft Windows Installation

D.3.2. Linux Installation ..... .

D.4. How to Extend the Framework?

D.4.1. Input Modality ...... .

TABLE OF CONTENTS

87

87

87

87

87

88

88

88

88

88

88

89

89

89

89

89

89

89

89

89

90

98

98

98

99

99

100

100

100

vii

D.4.2. Output Modality

D.4.3. World Entity

D.4.4. Action ...

D.4.5. World Hook

D.5. Putting it AlI Together

D.6. How to Use the Gesture Recognizer? .

D.6.1. Training Procedure ..

D.6.2. Recognition Procedure

D.7. Troubleshooting and Advices

TABLE OF CONTENTS

101

102

102

103

103

104

104

105

106

viii

LIST OF FIGURES

2.1 Taxonomy of gestures in Hel (this figure originates from Pavlovié's review [70])

2.2 General architecture of a virtual environment software .............. .

3.1

3.2

Typical hidden Markov model and its constituents

Sample gesture features in two dimensions. . . . .

3.3 Spotting of two circle gestures, which localization ("X") is supposed to be at the circle's

3.4

4.1

4.2

4.3

4.4

4.5

center. The spotted starting point is the dot's location.

Vision-based gesture recognition setup .

UML framework's architecture . .

Detailed view of the data pipeline

AData library class structure . . .

Input modality class structure example

Output modality class structure example

4.6 Sequence diagram for the rendering calI on a three-dimensional OpenGL display output

5

10

14

16

20

22

27

29

30

32

33

modality ............................................ 34

4.7 Input tokens UML class representation .

4.8 World, world entities and actions class structure

4.9 3D model configuration example . . . . .

4.10 Interaction manager and auxiliary classes

4.11 Data saving and undo pro cesses .....

35

38

39

40

42

4.12 Context grabber's class interface ..... .

4.13 Network manager and surrounding classes.

4.14 Network manager's sequence diagram

5.1 Gesture set used for recognition tests

5.2 Large gesture set . . . . . . . . . . . .

5.3 Typical framework's application scene

5.4 Experimental application's gesture dialogue

5.5 Framerate funtion of number of entities (debug version)

5.6 Framerate funtion of number of entities (release version) .

5.7 Interaction manager's processing time function of number of entities

LIST OF FIGURES

44

46

48

57

61

65

66

67

68

69

x

LIST OF TABLES

3.1 Comparison of different feature vectors selection . . . . . . . . . . . . . . . . . . . . .. 15

5.1 Recognition rate for feature selection, including the number of insertions (# ins.) and

substitutions (# subs.) ......................... 58

5.2 Mouse-based gesture recognition rate with improved trained HMMs 60

5.3 Mouse gestures recognition rate for a large number of possible gestures, including insertion

and substitution errors. . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Recognition rate of glove gestures with original and improved models 62

5.5 Recognition results for large number of gestures with and without context grabber 64

D.1 Summary of software components that need to be specialized .

D.2 Summary of the gesture trainer keyboard interface commands .

101

104

CHAPTER 1

Introd uction

1.1. Context of Research

Since the earliest room-sized computers, interaction between humans and comput ers has been an

issue spawning an important area of research and development in computer science and psychology,

called human-computer interaction (HCI). In this field, researchers are concerned with the design,

evaluation and implementation of interactive computing systems for human use. The result of this

research has led to several standardized metaphors and paradigms, which are optimized for certain

computing tasks.

Perhaps the most commonplace interaction technique in today's world depends on keyboards

and mice as input devices, and the WIMP (windows, icons, menus, and pointers) for graphical

interaction des pite this being an unnatural method of interaction with a computer, according to

several HCI studies [27,49,94].

A natural way to interact with a computer system would, ideally, allow users to communicate as

they do in everyday life with other people. Such a human-computer interface should support speech

recognition, track body motion, and understand eues regarding the user's emotions and intentions.

Many systems were developed to recognize speech [98], facial expressions [32], body actions [9], and

hand gestures [2]. Specifie devices also exist that allow input of user movements with additional

degrees of freedom beyond that provided by a mouse [20,71,72].

Similarly, it is possible to affect a greater variety of senses than those currently engaged by the

majority of personal computer systems. In contrast, immersive environments, particularly virtual

1.2 RESEARCH PROBLEM

reality, offer a more engaging visual representation. Systems such as the CAVE [24] or Immer

sadesk [26] use large projection surfaces to surround the user visually. Spatialized audio can be

rendered in order to simulate sound coming from definite positions. It is also possible to output

data that would affect the sense of touch with haptie deviees [33] or even smells that would trigger

a person's feelings [61]. Interfaces that allow the user to communieate through multiple input and

output sources simultaneously are called multimodal interfaces.

The aforementioned type of systems is the aim of the Shared Reality Environment (SRE) [19]

at McGill University, in whieh this thesis was accomplished. The target system is a walk-in-and-use

environment, such that a user would be able to interact with the virtual world without extensive

prior training. The interface must therefore be natural to use and should, ideally, adapt to the user's

needs.

1.2. Research Problem

The work accomplished as part of this thesis is based on a set of problems on gesture recog

nition, described in the next paragraph, whieh lead to a second research topie that focuses on the

interpretation of gestures in a multimodal virtual environment context.

Several years ago, Wexelblat [101] formulated a set of questions for the gesture recognition

community, whieh, in large part, remain unanswered. Most of his concerns were related to the

"natural recognition" of gestures, as they are usually performed when people communicate with

each other, and problems such as feature detection and continuous gesture recognition. A feature

is defined as a point of interest in the data stream from the capture device (mouse, glove, video

camera). However, it is not yet known which features are the most relevant in describing a gesture

accurately. These features also depend on the input device that is used; hence no universal solution

is possible. Several different gesture recognition algorithms therefore exist in the literature and their

use depends on the context of the application.

The gesture recognition community is also facing the problems of how gestures are interpreted

by the system and how they affect the virtual world. It is not sufficient to detect the occurrence of

a gesture; the mapping of gestures to their effects in software is essential to having a usable system.

In order to ease the development process needed to build software with which users interact through

gestures, such a system needs to be equipped with a generic and flexible software architecture that

2

1.3 THESIS ROADMAP

is suit able for arbitrary applications. It is also cssential to provide facilities that allow for the

combinat ion of modalities in a single expression sinee gestures are not likely to be used alone when

naturally interacting with a virtual environment, unless the system understands a gesture language

akin to American Sign Language (ASL).

1.3. Thesis Roadmap

The work presented in this thesis is focused on input and output data communication through a

general interface that is suit able for a large number of modalities. The presented virtual environment

software architecture is novel sinee it integrates the data pipeline, which spans from input to output

modalities, with the virtual world management. This is accomplished by means of actions that

apply to world entities linked with predetermined "behaviours", given corresponding input events

and application rules. Additionally, the definition of an XML file format that describes a multimodal

system as weIl as the virtual world constitutes a novelty of the current system.

A continuous dynamic gesture recognition module was also implemented in the course of this

thesis, using hidden Markov models (HMMs) as a statistical classifier. The recognizer's implemen

tation is therefore similar to other comparable systems. The choiee of features of interest and the

gesture spotting algorithm that will be described in Chapter 3 is however original work that aims

at improving results obtained with the existing systems.

Chapter 2 presents a literature review on gestures in communication, that offers insight re

garding different methods for recognizing gestures in an HCI context. Different virtual environment

software is also reviewed, from which the proposed framework was inspired. The chosen gesture

recognition algorithm is then presented in Chapter 3, foIlowed by a description of the different im

plemented gesture input modalities. An exhaustive description of the proposed software framework

foIlows in Chapter 4, with explanatory examples that justify the rationale behind the design deci

sions. Preliminary results are presented in Chapter 5, as weIl as an experimental application. The

thesis concludes with analysis and avenues worthy of further exploration in Chapter 6.

3

CHAPTER 2

Literature Review

This chapter presents a literature review of the major topics relevant to this thesis. Gestures

in communication are presented, followed by a review of gesture recognition techniques. Various

virtual environment software systems are then presented and reviewed.

2.1. What Is a Gesture?

Considerable research has been invested over the past few decades in order to improve the

interaction between humans and computers. One of the objectives of a user interface is that it should

be natural to use. Gestures have been shown to play an important role in everyday communications

between humans [56] in order to express emotions or to augment information conveyed through

other communication channels. Sorne examples of common culturally specific gestures would be the

"okay" sign, the "thumbs up" sign, the large amplitude gesture people make to catch a taxi, the

salutation gesture, and many others. AIso, people tend to gesticulate in order to mimic concepts that

have a spatial dimension which cannot be as easily described with speech. An example of gestures

augmenting speech can be found when a person describes her weekend: "1 killed a caribou that big" ,

while performing a gesture indicating the size of the caribou. One well-known use of gestures in

communication is sign language.

Several researchers studied sign language, particularly Stokoe [87] who defines the structure

of sign language as being described with a hand shape, a position, an orientation and sorne move

ment. Kendon [54] goes further by studying not only sign language, but every kind of gesture that

is performed in everyday life. He classifies gestures with the following categories: gesticulation,

language-like gestures, pantomimes, emblems and sign language, from the less structured semantics

2.1 WHAT IS A GESTURE?

Hand/Arm Movements

Gestures Unlntentlonal Movements

Manlpulatlve Communicative

~ Acts Symbols

~ ~ Mlmetlc Delctlc Referentlal Modalizing

FIGURE 2.1. Taxonomy of gestures in Hel (this figure originates from Pavlovié's review [10])

ta the most structured. Many other researchers in the field of psychology and linguistics have done

extensive research on the raIe of gestures in communication [13,56,73,74]. For a more detailed

survey on how gestures influence the communication semantics, see McNeil [60] who examines the

role of gestures in relation to speech and thought.

Gestures can be classified in many categories: dynamic gestures, static gestures, body postures

and body actions. Sorne systems specialize in the recognition of hand posture in order ta give

commands [6] or for recognizing human actions [62,88,103]. A taxonomy of gestures for HCl has

recently been done by Pavlovié et al. [70] in order ta classify different types of gestures in their

meaning and what kind of information they would provide to a computer system. This taxonomy

can be seen in Figure 2.1.

Manipulative gestures are used to mimic the manipulation of virtual objects. Unfortunately,

this mimicry does not include any sensory feedback from the object(s) being manipulated. Several

brands of haptic deviees are commercially available that capture users' movements and apply foree

feedback ta the hand (CyberGrasp [21]) or active feedback on the entire arm (PHANTOM [92]).

However, these deviees do not offer natural interaction sinee a user needs ta be attached ta sorne

sort of invasive tether or mechanics that remove any sense of naturalness.

On the other hand, communicative gestures do not require any external foree feedback ta be

used realistically, as they are performed in the space as in everyday life. Mimetic gestures mimic

actions that need to be performed on objects (e.g. a circular gesture may mean rotate a particular

5

2.2 MOTION CAPTURE HARDWARE

object). Deictic gestures, also known as pointing gestures, are heavily used while communicating

with other humans.

Abstract symbolic gestures usually represent an arbitrary action or object. There is not nec

essarily a natural mapping between a symbolic gesture and its meaningj therefore it should be the

user's choice as to the definition of each symbol. Baudel and Beaudouin-Lafon [2) argue that the

expected advantages of gesture interactions would be the naturalness of the interaction as weIl as

the richness of gestures and the direct interaction, removing the need of intermediate transducers.

They however point out several drawbacks that a gesture interface would have, namely fatigue, the

fact that gestures by themselves do not necessarily mean anything to the user, and more technical

problems such as the segmentation of gestures in a continuous recognition context. These disadvan

tages however suggest how the gesture set should be chosen and which technical challenges will be

the most difficult to solve.

2.2. Motion Capture Hardware

A gesture recognition system is a compound of hardware and software that first captures in

formation from the real world, then analyzes this information and draws conclusions on what is

happening in the actual world. The features of interest in the input stream coming from the hard

ware can be hand positions when a user is performing a gesticulation or can have more degrees of

liberty by capturing, for example, the position and eurvature of individual fingers as weB as hand

orientation and so forth. The number of degrees of freedom can increase dramaticaBy if every sin

gle physical one is considered in the recognition (e.g. CyberGlove, 22 degrees of freedom for the

fingers [20]), but it provides more flexibility when the gesture is performed. More flexibility how

ever adds more cognitive load on the user [95], particularly if the system does not use the device

effectively.

The next paragraphs describe existing hardware that is used in order to capture movement that

user's hands do. Positioning devices akin to Ascension Flock of Birds [23] or Polhemus [72] are used

to capture hand movement as weB as other moving parts on the body while performing a gesture.

These devices typically use perpendicular magnetic field emitters and sensors in order to measure

and triangulate the position of the worn device. One problem with such material is the physical

6

2.3 GESTURE RECOGNITION ALGORITHMS

tether that links the user to the system. This accessory can limit the user's movement and render

the interaction unnatural. However, wireless solutions reduce this impact.

Untethered environments naturally gain in popularity with the increasing computing power and

capability of today's computers. The majority of so-called free-hand systems use computer vision

as a source of data input. There are also other positioning devices such as the Vicon system [71]

that use passive infrared markers and cameras in order to position body parts accurately. These

are extremely accurate, but also prohibitively expensive equipment. Sorne vision-based systems

detect skin coloured blobs and pro cess them in order to extract useful knowledge of the real world

scene. Others use colour blob and markers detection in order to compute features' positions. These

markers can be located on the user as in Iwai's work [51] or on an external device such as the

VisionWand [15].

Other devices use hand movement and a real-world metaphor in order to navigate through the

environment such as the control action table (CAT) [45], which is a steering-wheel-like device. Other

systems use touch screens or similar kinds of devices in order to position in two dimensions a pen

device on a working surface, namely the Tablet PC [47].

2.3. Gesture Recognition Algorithms

Gesture recognition software includes several components that depend on the type of hardware

that is chosen. For vision-based gesture recognition systems, the first step in the processing pipeline is

the image analysis, which is meant to extract distinguishable features from a large data set. There

are essentially two ways of solving the problem of feature detection: model-based detection and

appearance-based detection. The principle of model-based detection is to analyze images and detect

interesting elements whose virtual representation is known, based on a predetermined constrained

model. For example, in sorne systems, colour blobs are detected, and in others skin blobs are isolated.

These coloured blobs are known to be associated with hands, head, face or other body parts that

match an a-priori known model of a user in a given environment. Blob detection methods rely on

accurate tracking since the system needs to know where they are located at every moment. Hence,

accurate and reliable tracking algorithms such as CONDENSATION [7], Kalman filtering [67],

CAM-shift [106] or mean shift tracking [57] must be used.

7

2.3 GESTURE RECOGNITION ALGORITHMS

As for the appearance-based methods, an operation is computed on the entire image in order

to find general characteristics that would match predetermined models. This processing is generally

a variant of optical flow [25] that is used to compute the movement in the image sequence. Vector

coherence mapping (VCM) [75] is known to extract motion fields in videos, imposing several con

straints on the resulting field. Motion history image (MHI) [8] is a method for finding temporal

changes in a video sequence, thus keeping track of every changing pixel in a history map.

The next step in the processing pipeline is feature extraction. It is an important operation

because it allows classification algorithms to be tractable computationally by reducing the amount

of data to consider. There are several possible operations to perform on the data in order to keep

only the most relevant features. A well-known method is the principal components analysis (PCA),

which is used to keep the feature sets with the largest covariance. PCA is particularly suitable for

appearance-based methods, or when the number of input data is very large. In the case of a small

number of processed features (e.g. model-based vision systems), it is possible ta compute relevant

features using raw input data [82].

The classification phase, also called the recognition process, is one of the critical parts for

a reliable gesture recognition system. It is possible to recognize many types of gestures: static,

dynamic or both at the same time. For static gestures, model mat ching is usually employed in

order to compare incoming data with a previously trained template. For instance, artificial neural

networks (ANN) can classify incoming data given sorne previously trained network models.

For dynamic gesture recognition, sever al statistical methods can be used. Dynamic time warping

(DTW) is employed to align the incoming data stream with a template [28]. Time difference neural

networks (TDNN), the dynamic version of ANN, can classify incoming data given a large amount

of training data [104]. Variants of TDNN have also been developed [58]. CONDENSATION-based

gesture recognition can be used in order to match a dynamic CONDENSATION model with the

incoming data [7]. One of the most popular statistical classifiers used for gesture recognition is

hidden Markov models (HMMs). These have been successfully applied to speech recognition and

can be adapted for gesture recognition [76].

Speech and gestures share common characteristics that justify the use of HMMs for the current

application. Both involve features that change over time and patterns that are repeated. In speech

recognition, an HMM is associated with every phoneme, from which researchers were inspired in

8

2.4 VIRTUAL ENVIRONMENT SOFTWARE ARCHITECTURES

order to define "gesturc phonemes" or "cheremes" for American sign language (ASL) [30]. However,

linguists did not agree yet on a common set of gesture quanta, of which every gesture occurrence

would be composed [97]. Therefore, in this thesis every gesture is associated with an HMM. Sorne

researchers use the classic representation of a hidden Markov model, whereas others try to add more

specificity and thus more reliability in the classification of the models in given contexts by creating

variants, such as coupled HMMs (CHMMs) [12] or parametric HMMs (PHMMs) [102].

The majority of cited systems achieve recognition rates above 90 percent, the recognition rate

being defined as the number of recognized gestures over the number of performed gestures. Most of

them are however tied to a particular application and therefore not suitable for general-purpose use.

Due to those limitations, virtual environment software architectures were investigated in the course

of this thesis. As presented in the next section, the general architecture of virtual environment

software suggests that building user-specifie modules, from the basic building blocks, would be

simplified. The ideal system would aim at finding a way to adapt a maximum number of gesture

recognizers to an interface, allowing users to implement their particular applications. An important

design goal of the current system, flexibility, is therefore to remove any dependency on the hardware

employed for a given gesture recognizer, leading to an abstract input modality interface that needs

to be specialized. The same scheme should be applied to output modalities, as weIl as aIl of the

architecture's abstract components.

2.4. Virtual Environment Software Architectures

An important aspect of virtual environments is how the software is put together in order to

build upon existing software components to accommodate future applications. Object oriented soft

ware frameworks provide a set of abstract classes from which a user inherits in order to include

application-specifie code. A framework is therefore a large piece of code that is extensible for partic

ular applications, while providing building blocks for theoretically any supported type of application.

A general schematic of the composition of virtual environment systems can be seen in Figure 2.2.

Many systems that try to provide flexible ways of building virtual environment applications exist

such as VRPN [90], VR-Juggler [4], Tandem [40], and DIVE [18].

VR-Juggler [4] is intended to be a development environment for virtual reality applications. It

provides a set of class interfaces that a user needs to extend in order to use specifie devices. It is

9

2.4 VIRTUAL ENVIRONMENT SOFTWARE ARCHITECTURES

Input devices

Network communications

World data management

Output devices

Application context

FIGURE 2.2. General architecture of a virtual environment software

also configurable with a graphical user interface (GUI), written in Java and that uses CORBA in

order to communicate with the VR-Juggler kernel written in C++ [46].

VRPN (Virtual Reality Peripheral Network) [90] is a set of classes used to provide a transparent

network layer between devices and user applications. It is not intended to serve as visualization

software, but rather, as an input manager that connects remote devices transparently for a user of

the toolkit.

DIVERSE (Device Independent Virtual Environments - Reconfigurable, Scalable, Extensi

ble) [53] is a virtual environment toolkit rather than a framework. It provides a distributed shared

memory of the state of the world for every remote instance of the environment, as weIl as an ab-

straction for the input and output layers.

Other frameworks include Avango (Avocado [93]), MASSIVE [41], Tandem [40], and blue

c [64], which have different design goals of flexibility and scalability for physical input or output

devices, and shared memory over the network. In the majority of these numerous existing frame

works, the emphasis is put on the abstraction of devices (input and output) as weIl as on network

communication. In the current work, more effort was put on facilitating the way users would want to

configure a multimodal framework and how they would model the virtual world in a flexible manner,

allowing input modalities such as gestures to act on what is rendered in the real world.

10

2.5 DESIGN GOALS

2.5. Design Goals

The goal of this project was ultimately to create vision-based gesture interaction software to

be used in immersive virtual environments. Video input was chosen, because it allows a user to act

on a virtual environment without using any worn device that would be linked to the computer. l

It enhances the sense of immersion and provides the user with a better virtual experience in a

CAVE-like projection system. At the outset of this research on gesture recognition algorithms, it

was noticed that many systems exist to recognize human gesticulation in different situations with a

large number of hardware devices. However, generic systems that involve gestures are not common

in the literature. The generality requirement can be defined as a flexible and easy way to configure

how input gestures will influence the virtual world with which a user is interacting.

To push this further, the framework was implemented with the capability of supporting ad

ditional inputs, such as speech, multiple gesture recognition devices or keyboard. However, the

problem of combining multiple input modes was left for ongoing work. Additionally, a virtual world

model was implemented in order to provide users a flexible way to build an environment with several

types of entities that are to be rendered in several different output modalities (e.g. 3D display, 2D

display, spatialized sound system). Network communications were implemented to share a virtual

world among two or more computers, providing facilities to maintain the coherence, and allowing

for distributed input and output modalities to be instantiated, such that a remote computer could

send events to its clients. The evaluation of the networking module was not part of this thesis, and

is left as future work.

The principal design goals and requirements that lead to the decisions regarding the software

and general architecture design are:

• vision-based gesture recognition system

• visualization of the virtual world using three dimensional views

• flexible way to configure how gestures will influence the virtual world

• multimodal capabilities (input and output)

• networked virtual environment maintaining world coherence

1 In the present case, users have to wear coloured gloves.

11

CHAPTER 3

Gesture Recognition Module

In this chapter, a description of the continuous dynamic gesture recognition module is presented

that includes an introduction of the chosen statistical classifier, the hidden Markov model. Details

of the associated training and recognition algorithms are also provided, followed by a word on the

method used to select features and ending with a description of the three implemented gesture input

modalities. In the context of this thesis, a feature is an interesting data point calculated from the

raw input data stream. The remaining sections of this chapter describe original work on the choice of

feature vectors, the gesture spotting algorithm and the implementation of three gesture recognition

input modalities (mouse, data glove and video camera)

3.1. Introduction to Hidden Markov Models

Hidden Markov models (HMMs) were chosen for representing gesture models in this thesis

because they allow for both spatial and temporal variations in the input data. HMMs are also well

known in the literature, and implementations are freely available [36,39,79,91]. Previous work on

gesture recognition shows that reasonable recognition rates can be obtained using a classical HMM

implementation, with discrete or continuous gestures [1,14,78,80,86,89], Another advantage of

HMMs is that given the chosen features, the state sequence can have a physical meaning that the

spotting algorithm can take advantage of, unlike neural networks, which do not have any meaningful

internaI structure [29].

Hidden Markov models have been used for a long time in the scientific community, but have

recently enjoyed a gain in popularity due in part to the field of automatic speech recognition [76].

An HMM is a collection of random variables with an appropriate set of conditional independence

3.1 INTRODUCTION TO HIDDEN MARKOV MODELS

properties [5]. An HMM can also be described as astate diagram whose states are unknown

("hidden"), each of which has an associated emission probability distribution with probabilities

linked to the transitions between those states. For extensive reviews and applications of hidden

Markov models, sever al tutorials are available [5,38,76,77).

More formally, a hidden Markov model, À, composed of N hidden states, can be defined by

À = (A, B, 71"), whose parameters are as follows: 1

• A = {aij} is the transition probability matrix, where aij = P [qt+l = Sj Iqt = Sil is the

transition probability from state i to state j, S = {Sb S2, .. , Si, .. , SN} is the set of indi-

vidual states and qt is the state at time t.

composed of n continuous Gaussian mixtures that give the observation's probability bj (x)

at a given state qt = Sj for an observation value Dt = x at a given time t. Depending

on how sparse the training data is, a model will be composed of one or more continuous

output observation Gaussian mixtures. It is also possible to have discrete observation

data, but only continuous ones are considered in this discussion .

• 71" = {7I"b 71"2, •• , 7I"N}, where 7I"i = P [qi = Sil is the initial probability of each state.

There are three standard ways that HMMs are used in practice:

(1) ta determine the probability of occurrence of an observation sequence given a hidden

Markov model.

(2) to determine the most probable state sequence explaining an observation sequence for a

given hidden Markov model.

(3) ta adjust the model parameters in order to maximize the likelihood of given observation

sequences.

The first task is useful for assigning a score to each considered model, provided an observation

sequence. It will be used in the recognition stage of the gesture recognition system. The second task

reveals the underlying structure of hidden states that occur for a given sequence and might be useful

to characterize a gestural expression. However, the choice of features in the current thesis does not

require knowing the state sequence. The third task allows hidden Markov models to be built using

1 From that point, continuous hidden Markov models with Gaussian mixtures emission probabilities are considered. For a detailed description of discrete hidden Markov models, see Rabiner [76].

13

3.2 CHOOSING THE FEATURE VECTOR

FIGURE 3.1. Typical hidden Markov model and its constituents

training data. The goal of this offiine process is to optimize HMM parameters in order to find the

best fit with the given training data.

Tasks 1 and 3 will be described in further detail in the following sections. In the current

implementation, each gesture is associated with an HMM having a number of states depending on

the training phase, up to a maximum of seven states. 2 In the present case, the matrix is very sparse,

except for the transitions of astate to itself and to its immediate succession. This specialization of

the hidden Markov model is called a linear or left-right HMM and can be represented as in Figure 3.1,

where the arrows that enter a circIe (state) are transition probabilities.

3.2. Choosing the Feature Vector

Before starting to use hidden Markov models, there must be an operation to extract the most

meaningful information from an input data stream. This operation produces the feature vector and

constitutes data that is passed to the hidden Markov model as observation vectors. In the current

context, the raw input data cornes from the mouse, data glove or video cameras. The gesture

recognition system is based on position capturing systems in order to be able to use these devices

with the same recognition algorithm for later comparison of results. The format of these input

streams is a vector of floating point numbers, either two or three-dimensional.

Different kinds of feature vectors have been considered in order to determine which is better

suited for gesture recognition in the current context. Sorne characteristics that the feature stream

must provide to a corresponding trained model are the following:

2Seven is an arbitrary value, determined empirically.

14

3.2 CHOOSING THE FEATURE VECTOR

• the model must be translation independent in order to be able to perform the gesture any-

where in the environment. The target position of the gestural expression can be recovered

by looking at the data points that constitute the gesture. It is also possible to find an

"application point" or gesture localization by averaging the positions of aIl gesture points.

• the model must be velocity independent in order to be able to recognize a gesture regardless

of the execution time needed to accomplish it. The velocity of execution can be recovered

by looking at how many samples the gesture is composed of, given a fixed sample rate.

• the model must be size independent in order to recognize gestures spanning arbitrarily

large areas. The size of a gesture can be recovered by looking at the bounding box of a

particular sequence.

Features Independent of Dependent on Comment x,y, z Velo city Translation, size, rotation User should always perform the ges-

ture at the same place dx, dy, dz Translation Velo city, rotation Velo city dependence not wanted

r,(),cjJ Translation, size Rotation, velo city Velo city independent if not using r dr, d(), dcjJ Translation, size, Velocity Rotation independence not wanted

rotation

TABLE 3.1. Comparison of different feature vectors selection

The requirements aim at reducing the actual number of gesture models. For instance, two

circIes being performed at different velocities and sizes will be recognized as the same gesture, but

will have different size and velo city parameters when passed to the interaction manager, as described

in Section 4.6. These requirements tend to eliminate several possible features as seen in Table 3.1.

One feature vector that has proven to be appropriate is the direction of the vector spanned by

two consecutive positions, as seen in Figure 3.2. An input data vector at time t will be denoted

Pt = (Xt, Yt, zt). This angle is calculated in 2D with the following equation:

() = arctan (Yt - Yt-l ) Xt - Xt-l

In 3D the two angles calculated are:

(Yt - Yt-l) () = arctan Xt - Xt-l

(3.1)

(3.2)

15

3.3 TRAINING ALGORITHMS

9t-1 Pt-1 __ -.L.. __

FIGURE 3.2. Sample gesture features in two dimensions

(Zt - Zt-l)

<p = arccos r ' where r = Ilpt - pt-III (3.3)

This feature vector is velocity independent since the movement vector's magnitude (r) is not

taken into account. It is also size independent for the same reason. The only dependence of this

feature vector is the rotation, which is in fact desirable because the rotation parameter cannot be

as easily recovered as the velocity and size.

Lee [59] has a similar approach to choosing the angle of the vector spanning the movement as

a feature, except that he quantizes the angles in sixteen steps. An advantage of quantizing values is

that the computation of the observation probability will not imply Gaussian calculation, but in sorne

cases it might be hard to distinguish two very similar gestures. Cao [15] suggests quantizing the area

that a gesture spans, such that every point of a gesture sequence will be located in a quantized area.

Using this type of feature, it is possible to recover complex speed-dependent gestures. However, for

continuous gesture recognition, extra processing is needed in order to segment correctly start and

end gesture points with an arbitrary number of frames considered for recognition. Similar results

concerning three-dimensional feature selection can be found in Campbell's work [14].

3.3. Training Algorithms

The goal of a training algorithm is to build a gesture model for later recognition, given known

pre-segmented gestures. A well-known HMM training method is the Baum-Welch or forward-

backward algorithm [5,76]. It finds the model parameters that maximize the probability of oc-

currence for given observation sequences. One drawback of the Baum-Welch method is that it

optimizes the model parameters over all possible state sequences, rather than only considering the

16

3.4 CONTINUOUS GESTURE RECOGNITION

most likelyone [52]. Another HMM training algorithm proposed by Juang and Rabiner [52], called

the segmental K-means algorithm, alleviates this problem. Instead of finding the best model that

matches observations for aIl state sequences, it finds model parameters that optimize the score only

for the best state sequence found with the Viterbi algorithm [34], which also returns the associated

score. The HMM parameters are then re-evaluated until they converge to optimal values [52], given

a certain threshold. LTI-Lib [79] was chosen as an appropriate implementation of hidden Markov

models data structures and their associated algorithms. 3 The HMM training method used in the

current implementation is the segmental K-means algorithm.

The training process for the gesture recognition system relies on several repetitions of a given

gesture as an input to the trainer. 4 This pro cess is also called supervised learning since the user who

provides data knows how reliable it is. Providing isolated gestures is a requirement for such a training

process, which involves an external clutching mechanism that notifies the system when a gesture

st arts and ends. When using a mouse for gesture input, the depression and release of the mouse

but ton offers such a mechanism explicitly. However, in the case of vision-based gesture recognition,

which is not supposed to use any external device for data grabbing, a user needs a way to indicate

the beginning and end of a gesture. Hence, a keyboard-based user interface has been implemented

that allows one to manage the training pro cess in creating new gesture models, indicating begin and

end points of gestures, saving and loading files and deleting models. The training pro cess is thus

different from the continuous recognition stage, as the latter does not need to be told the beginning

and end points of gestural expressions.

3.4. Continuous Gesture Recognition

This section describes how gestures are automatically recognized and extracted from an input

data stream. Continuous gesture recognition has recently been studied [16,101,102], and is defined

in the literature as a pro cess by which gestures are isolated from an input data stream without

needing to specify start and end gesture points explicitly. Kendon found that when a gesture is

pcrformed, there is a preparation and a retraction phase that are relativcly easy to detect [54].

Quek [73] establishes several rules based on the observation of expressive gestures that constrain the

3LTI-Lib is an open-source object-oriented library built in C++ that uses STL containers as a basis for data structures. This library provides algorithms and data structures that are often used in computer vision. 4The number of repetitions is typically on the order of twenty for users not to be overloaded.

17

3.4 CONTINUOUS GESTURE RECOGNITION

recognized gestures to have a certain start and end position, as weIl as predefined movement phases

in order to help gesture segmentation. One goal is not to impose any starting position or constraint

for gestures since they must be performed as naturally as possible. Hence, there is need for a gesture

spotting algorithm that detects the end of most probable gestures and draws conclusions accordingly.

The next sections present the algorithm that is used to generate hypotheses from the input data

and known gesture models, and a description of the gesture spotting algorithm.

3.4.1. Hypotheses Generation Algorithm. The usual way to recognize an isolated

gesture is to apply the Viterbi algorithm to every trained model and to select the one with the largest

score.5 Obviously, the normalized highest score should be higher than a normalized threshold in

order to be valid.

For a continuous recognition system, it is unfortunately impossible to apply the Viterbi al-

gorithm directly because the starting point of a gesture is unknown. The system should therefore

generate hypotheses every time a new feature vector is added to the recognizer's feature vector input

stream. Gesture hypotheses are ranked using an inverse Log-likeIihood scoring strategy, keeping the

best gesture end hypothesis at every time step. A gesture end hypothesis is created when a gesture

hypothesis has reached the last state of its associated HMM. The procedure as it is implemented in

LTI-Lib is summarized in Algorithm 3.1.

Algorithm 3.1 Gesture hypotheses generation

expand the current best hypothesis to obtain a reliable pruning value active hypotheses {::: generate new hypotheses from every known valid hidden Markov Model perform the Viterbi step on aIl active hypotheses new pruning threshold {::: prune the active hypotheses with bucket sort keep track of the gesture end hypothese in a trace-back field

For every input feature vector, new hypotheses are generated and a score is calculated in order

to prune hypotheses that are not likely to have occurred. The pruning threshold is calculated such

that a limited number of hypotheses is maintained.6 Even if the scores are not of the same scale,

they are compared against each other, which typically increases the likelihood of longer gestures.

SIn the present case, inverse of the Log-probability is used. Therefore, the most likely model would be the one with the lowest score. 6 Typically, 100 hypotheses are kept.

18

3.4 CONTINUO US GESTURE RECOGNITION

3.4.2. Gesture Spotting Algorithm. Considering hypotheses scores exclusively does

not allow for detecting a gesture and its endpoints. A fixed threshold on hypotheses score would

not provide reliable recognition since these scores vary for every gesture model. A gesture spotting

algorithm is therefore needed in order to isolate a gestural expression. Lee [59] developed a threshold-

based gesture spotting algorithm that takes every node of aIl known hidden Markov models to for~ a

threshold model. Conditions for spotting gestures are well-defined, but extra processing is necessary

in order to compute the threshold model's score, which can lead to slowdown in performance. An

alternative algorithm that uses the topology of hidden Markov models was developed in the course

of this thesis. Gestures are spotted based on Algorithm 3.2, wherein a gesture end refers to the

termination of a gesture.

Algorithm 3.2 Gesture spotting

initialize a stack of gesture end hypotheses if current best hypothesis reached the last state of its model and it is the most likely gesture then

push the hypothesis on stack if the hypothesis becomes less likely, or the best hypothesis' inner state is not the last model's state then

gesture is spotted and the stack is emptied if the most likely hypothesis is not the one on the top of the stack then

if most likely hypothesis and hypothesis on the top of stack have the same size then pop the stack and push the new hypothesis

else spot the gesture and empty the stack

The topology of an HMM, with the movement's angle as the feature vector, gives an idea of

the gesture's shape because each state represents a part of the gesture. When the end state is

reached, it means that the current best gesture hypothesis is Iikely to be true. The latter is a vaIid

assumption for gestures whose complexity is such that it is not likely that they will occur during

random movements of a user. The end of a gesture sequence is found using Algorithm 3.2.

On the other hand, the st art point of the gesture hypothesis is known since its length is being

incremented every time a new feature vector is added. However, the starting point is not always

accurate, especially when the first HMM state is the most likely for a long period of time, as

illustrated in the left of Figure 3.3, in which the gesture was performed with a long preceding trail.

Incorrect starting point detection occurs due to the similarity between the trail and the first HMM

19

3.5 GESTURE INPUT MODALITIES

x Wrong spotting Correct spotting

FIGURE 3.3. Spotting of two circle gestures, which localization ("X") is supposed to be at the circle's center. The spotted starting point is the dot's location

state most likely feature vector, leading to faIse gesture localization. This is a consequence of the

hypotheses pruning algorithm, which favors longer gestures.

There are severaI solutions to this problem, one of which would be to impose user constraints

on the manner in which the beginning of a gesture should be performed, as seen in the right of

Figure 3.3. This would however increase the cognitive load on users since they would have to

remember to perform a special notch in the trajectory before performing the actual gesture. A

compromise would be to indicate to the user at every moment where the system thinks the starting

point of a gesture is located. This could easily be done by drawing an indicator of the starting point

on the gesture's fading tail trajectory. Users would then be able to predict when a spotting error

is to occur and take correcting measures accordingly. The ideal solution is however to use gesture

models that are sufficiently different from random movement, such that no confusion is possible for

the recognizer.

3.5. Gesture Input Modalities

The following section presents various input modality software modules that were implemented

over the course of this thesis. It should be noted that every implementation shares the same interface,

as specified in the framework's description in Chapter 4.

3.5.1. Mouse-based Gesture Recognition. A mouse is a popular pointing device that

is used in everyday life by most people who work with desktop computers. It is in fact a two

dimensional relative hand position tracker. In many applications such as web browsers [63,85] or

games [48], mouse gestures are used in order to perform common operations (e.g. forward, back, new

window, express a creature's emotions). However, in sorne ofthese systems, gestures are triggered by

20


a mouse buttonj they are therefore not continuously recognized. The continuous gesture recognition

can be used for detecting motion patterns that a user performs when moving the mouse on the

desktop or in the virtual world. Unlike usuaI systems, a continuous gesture recognition system

should not, by definition, use mouse buttons in order to trigger the starting and ending point of the

performed gesture.

Multiple mice can be used in the current implementation, which renders possible bimanual

mouse interaction. For that purpose, the RawMouse [22] Windows API must be caIled, whose

methods are only available on Windows XP. They are used to bypass software drivers in order

to receive raw movement and buttons' status from multiple mice. Similarly on UNIX systems, X

events can be used in order to capture mouse movement, several mice at a time. In order not to

overload the gesture recognizer, the mouse frame rate is down sampled from 125 Hz to a maximum

of 30 Hz.? More details on bimanual interaction are presented in Guiard's work [44]. For example,

bimanual interaction could be used to localize gestures accurately, using one hand to perform the

actual gesture and the other to localize the gesture application point.

3.5.2. Glove-based Gesture Recognition. Data gloves such as CyberGloves [20], or

other position capture devices have been used in numerous systems. They allow for rich interaction

because the position they capture is reasonably accurate, and most of them include bending sensors

on the fingers. The glove interfaced in the course of this thesis is the P5 Glove, which is intended to

be used by video game players [31]. It provides 3D position, orientation, finger bending as weIl as

but ton press input data. The position and orientation are triangulated by sensors on the glove that

receive infrared signaIs from an emitting "tower". At least three of the glove's infrared sensors must

be in the field of view of the tower's signaIs in order to obtain reliable data. Since the data coming

directly from the P5 glove is less accurate than other more sophisticated devices, there is a need for

a filtering stage to be done. A KaIman filter as weIl as a morphologie al filter were implemented in

order to clean the noisy measurements received from the P5 drivers. 8

One of the drawbacks with this kind of glove is the area that can be covered, which is limited by

the length of the cable linking the glove to the tower. The gesture recognition is, for the moment, only

7125 Hz is the maximum mouse sampling rate on Windows XP. BThe filters were implemented by François Dinel from CVSL at Laval University.

21


performed with the 3D positions without considering finger bending. However, with the archit.ecture

that will be described in the next chapter, recognizing static gestures of the hand is possible.

'-;--,..-----1 Tracker 2

FIGURE 3.4. Vision-based gesture recognition setup

3.5.3. Vision-based Gesture Recognition. A general description of the video-based

gesture recognition system can be seen in Figure 3.4. After being digitized by frame grabbers,

images are passed to an edge-based tracker that analyses and keeps track of the extracted edges.

Time-differencing motion is used in order to detect moving entities from one frame to the other.

A background training stage is also performed in order to differentiate background and foreground

edges. Colour blobs inside the edges are taken into account, and are used as a validation stage for

object consistency.

For the moment, in order for the user's hands to be tracked, one has to wear coloured gloves

(blue and green), so that the tracker will be able to disambiguate between the two hands. As the

tracking of skin-coloured blobs becomes more reliable, a user will not have to wear any aceessories

at aH. However, sinee the tracker is still under development this technique was chosen as being the

least invasive for a user. The colour blobs are tracked over time and their 2D positions are sent to

a camera integrator that generates 3D positions given two distinct camera views.

The relative topology of Camera 1 (in front of the user) and Camera 2 (above the user) that

can be seen in Figure 3.4 aHows for the extraction of 3D positions. Camera 2 is used to recover the

x and z coordinates, whereas Camera 1 is used to recover the y dimension. No stereo matching and

22

3.6 CHOOSING GESTURES

camera calibration is done for the moment, which leads to imprecise hand positioning. Therefore,

gesture models are only valid for a particular placement of the cameras, and each time they are

moved, there is a need for a new gesture training phase.

The network communication of the blob positions is ensured by a "server library" that runs a

daemon on each of the computers involved in the pro cess. 9 A naming service is used in order to

facilitate the connection between two peers involved in the data transmission process. Message data

consists of plain text commands formatted to be understood by the receiver. Commands and data

are sent with the principle of remote method invocation (RMI) in mind, but with less flexibility

and ease of use since the marshalling is not done automatically. The network communication would

benefit to using middleware such as CORBA [68] or similar RMI systems [105]. This issue will be

addressed in Section 6.2 discussing future work.

3.6. Choosing Gestures

A natural gesture is defined as a motion pattern that people tend to use in their everyday

interpersonal communications. Recognizing natural gestures performed by users in immersive en

vironments is the ultimate goal to achieve for a gesture recognition system that aims at natural

interaction between the user and the system. It is however quite hard to determine which naturally

performed gestures bring meaningful information to the system. Apart from sign language, there

is no convention regarding a gesture set and semantics that would allow people to communicate

with each other. There are however symbolic gestures recognized within cultures (e.g. "okay" sign,

waving hand) that are meant to support expressive communications or grab one's attention when

it is impossible to do so with speech. The main use of gestures in everyday life is the deictic form.

Deictic or pointing gestures are used to refer to spatial characteristic of objects. They are heavily

utilized in or der to grab other people's attention on spatial details or to give orders like "put that

there" .

Systems that make exclusive use of gestures in order to control a virtual world exist. Baudel and

Beaudouin-Lafon developed CHARADE [2]. This system however has very few available commands,

which are chosen to be as "natural" as possible. For more complex systems in which commands

are more numerous, there is an issue on the user's cognitive load when the number of gestures

9The "server library" was developed by Jeremy R. Cooperstock.

23

3.7 CONCLUSION

to remember increases. This is particularly true when no mapping exists between the command

and a real world's gesture. To alleviate this problem, user interface widgets can be used. We have

developed the pieglass [11], a widget controlled by bimanual pointing gestures. Pieglasses allow for a

mapping between commands that do not have a common gesture representation to pointing gestures

that are used to select tools and actions. An advantage of having a layer between gestures and

commands in the virtual world is that the cognitive load imposed on the user is lessened. However,

for gestures that can be mapped to real world's actions, it is obviously an advantage for a user to

execute directly the natural action without passing by the widget as long as the gesture recognition

system is reliable.

Choosing gestures for reliable interaction is of crucial importance because a gesture recognizer

is not perfect. There are always recognition errors that occur due to confusion with other gestures

or too much dissimilarity between the current sequence and the model, which does not lead to any

recognized sequence. In general, gestures must be distinguishable enough, so that a user will not

invest a large effort in trying to disambiguate the recognition system. Chosen gestures should also

be distinguishable from non-gestures, so that when a user is moving around the motion will not be

recognized as a known gesture.

3.1. Conclusion

To conclude with the gesture recognition implementation module, hidden Markov models are

used in order to recognize gestures performed by users when controlling a virtual environment. Most

of the work focuses on iconic and deictic gestures. Iconic gestures are used as triggers for commands

that do not necessarily have a natural mapping to a gesture. An example of a natural mapping

would be to draw a circle around a virtual object in order to select it. An example of non-natural

mapping would be a gesture perfarmed to texture an object, as users would be unlikely ta agree on

a common, appropriate representation for a texturing gesture. Deictic gestures are used to point at

objects and move them around in the virtual space.

It is clear that using gestures alone is not the best way to interact with a computer, indicated

by the fact that in everyday life people use multiple modalities in order to communicate with each

other. Language (speech and writing) is the most structured form of such communication. It has

to be learned and practiced for a person to be proficient at it. Language is good for describing

24

3.7 CONCLUSION

and naming things, but is quite limited for spatial description. This is why multimodal systems

exist: gestures are used for operations involving spatial parameters, whereas speech is preferred to

give commands. Both can be used simultaneously, but there is then a need for integrating the two

modalities in a consistent way, based on a predefined grammar [96J.

Despite the fact that the more accurate a recognizer becomes, the better the user's experience

will be, interaction with a virtual environment does not rely on the recognizer alone. The software

that includes the recognizer has a large role to play in the realism of a user's virtual experience. One

requirement is a mapping as to which performed gesture will result in which action in the virtual

environment. In order to map tokens in an input stream onto virtual objects effectively, a flexible

and generic software architecture is needed to ease the configuration of how the system will respond

to user's actions. The architecture designed to meet these goals is described in the following chapter.

25

CHAPTER 4

Software Framework

The generality constraint of the proposed software architecture cornes from the following require

ments that users might desire. Many input modalities may be exploited by different users and several

output modalities employed in the rendering of a virtual environment. The desired mapping of ges

tures or other input modalities to consequences on virtual objects can vary across users. FinaIly,

additional input or output devices can be deployed at runtime. A software framework is intended

for solving the problems of generality, flexibility and extensibility. The basis of the architecture is

a set of classes that define interfaces which should be generic enough to fit the user's needs, and

provide mechanisms that would help a programmer to specialize the framework for different appli

cations. Every user-specifie component is placed in a dynamically linked library that is to be loaded

at runtime, as specified in the software initialization process.

The presented software was written in C++ to take advantage of several mechanisms such as

polymorphism and inheritance that render the architecture more flexible. Most of the libraries that

provide functionality to the system were only available in C++, which constitutes another motivation

for implementing the framework with that programming language. In order to be portable on most

operating systems, the software uses the ACE (Adaptive Communication Environment) OS wrapper

library [81] to perform operations that are not standard on aIl operating system platforms such as

network sockets or threads. Unified modeling language (UML) was used to model the software

and visualize the relations between the different software components. Several UML class diagrams

are included in this chapter, showing the class relations as weIl as the most important attribut es

and methods. For a quick reference on the UML graphical notation, see the Object Management

4.1 OVERVIEW OF THE FRAMEWORK'S ARCHITECTURE

Group (OMG) specifications [43]. XML (eXtensible Markup Language), which is a text format that

employs the tagjattributes (or markup) metaphor in order to represent tree-like data structures, is

used as a configuration mechanism to facilitate application development. For the reader who is not

familiar with the associated notation, see Annex A.

In this chapter, the implemented virtual environment software framework is presented with an

exhaustive description of aIl its constituents and configuration mechanisms.

+m_XMLConfigurationParser

XMLConfigurationParser

#m_networkManager

+m_ outputManager

+m_inputModalities o .. " +m_worldManager 0 .. * +m_outputModalities

! InputModality 1

. 1

r! W~O'>"rld-M.:r.a-n-a-ge-r-'! 1.!.C?'!t!!yt~~~~!itt 1 ---i 1 -1 1 •

FIGURE 4.1. UML framework's architecture

4.1. Overview of the Framework's Architecture

The framework's software architecture can be seen in Figure 4.1 in logical UML specification.

The class of type Instance is the interface class that needs to be instantiated by a user's application

program. It actuaIly defines an instance of a virlual world as weIl as the data pipeline that extends

from input to output modalities, including the different actions that can be applied to the entities

that compose the world.

In Figure 4.1, the different software component managers are shown, being the InputManager,

OutputManager, InteractionManager, WoridManager and NetworkManager. Every manager can be

configured using actual code or with an XML file, which is described in Section 4.9. The configuration

data corresponding to every manager is passed to an initialization method called init. The role of

every manager followed by a brief description is presented as follows:

27

4.1 OVERVIEW OF THE FRAMEWORK'S ARCHITECTURE

• InputManager: this manager instantiates and st arts input modalities, according to the

user's specifications.

• Interaction Manager: this manager links input and output modalities. In the initialization

function, a grammar is configured and corresponding actions are instantiated. At runtime,

this manager spawns a thread that processes data received from input modalities and

influences the virtual world according to the grammar loaded in the initialization method.

• NetworkManager: this manager handles networking operations, namely the transmission

of input events, the world coherence among multiple instances and the management of

resources shared among participants. It can instantiate a server or connect to a remote

virtual environment while handling incoming connections and received data.

• Output Manager: this manager instantiates and configures every output modality that

the user needs. Examples of these modalities could be 2D or 3D displays, a 3D sound

system or a haptic device.

• WorldManager: this manager instantiates and initializes every world entity of which the

virtual world is composed. It is also used to set the behaviours and add corresponding

properties to world entities, a procedure that will be described in detail in Section 4.5.

As can be seen from the previous descriptions, most of the managers are only used during the

initialization stage or to ensure the data structures' coherence and management. Nevertheless, they

play a role of crucial importance for the system to work properly.

Figure 4.2 shows the dynamics of the data pipeline, where clouds are worker threads. Every time

an event occurs on the input side, an input token is emitted and added to the interaction manager's

message queue. The latter call is asynchronous, as it is invoked from an input modality's thread.

When an input token is put in the queue, the interaction manager parses it in the processing thread.

The actions that match criteria, as described in Section 4.6, are then applied to world entities, which

are aggregated in the World class that is itself a WorldEntity object type. In order to obtain feedback

from the virtual world to the real world, output modalities have to be instantiated and entities

rendered. The latter is accomplished by calling the render method on an output modality that will

subsequently call the render method on every entity contained in the world. Each entity knows

how to render itself in a given output modality. The entity's properties are used to set variable

28

4.2 ADATA, GENERIC DATA CONTAINER

Input modality

Instance

Interaction manager

Parse token 1

Emit Id±: queue

token

Apply

Render--' Output modality r.-Render

FIGURE 4.2. Detailed view of the data pipeline

Action

World and entities

parameters. The rendering pro cess is generally initiated with a calI coming from the Instance class,

which owns a reference on output modalities.

4.2. AData, Generic Data Container

The AData library provides the fundamental data storage mechanism used throughout the

description, and is a prerequisite for having a framework whose internaI data format is universal. 1

An AData is a generic data container that allows a user to have a common data type in order to store

variables whose type is to be fixed at runtime. That is, aIl specialized data structures are inherited

from the AData class. As seen in the UML diagram of Figure 4.3, an AData object holds a pointer

to a class of type ADNode, the concrete data container that is reference-counted. The AData class

and its specializations contain aIl the needed methods in order to access the data pointer; therefore

it implements the bridge design pattern, which separates the interface from its implementation [37].

The AData class also manages the container's memory allocation, acting as a smart pointer for the

con crete data types.

There are sever al formats in which data can be stored. At the moment, two types are supported:

STL (ASTL, from standard template library) and XML (AXML). The former data classes inherit

from both ADNode and an STL data container class such as map, set, vector, list. There is also a

IThe software library A Data, which stands for APIA Data, was developed at the computer vision and systems laboratory (CVSL), Laval University, as part of the APIA (Actor, Property, Interaction Architecture) project [3].

29

4.2 ADATA, GENERIC DATA CONTAINER

AONode (tom ASTl,) AOata

~refcount : inl l)'romASTl,) -'name: Sld::string -'updateFlag : bool +ptr_ f'ptrQ : ADNode"

, ~sValidO : bool "'SetUpdateFlagO : void .operator ->0 : AONode" .CloneO : AONode" .operator ADNode"O .DuplicateO : ADN ode" ~trQ : const AONode" .ReleaseO : int .stringO .ResetUpdateFlagO : void

Î f AONStrng String (fromASTl,) (fromASTl,)

~Iue : std :string

·CloneO .IsValidO ·operalor ->0

FIGURE 4.3. AData library class structure

class template for simple types, namely double, float, integer, string. The XML format, on the other

hand, is used to store data in a document object model (DOM) whose document type is predefined.

It is possible to read XML files to the AXML data format and vice-versa. In order for XML data to

be used in the application, converters exist for transforming data from AXML to ASTL. Similarly,

ASTL data can be converted to AXML, which is to be used for data serialization. The structure

data type does not exist in ASTL, but it is possible to build a structure-like data type with the

template Map<AData, AData>. The first template argument should be of type String, being the

identifier and the second ean be of any type, being the member.

XML was ehosen as the format in which an AData would be stored in files beeause it is a

human-readable text format and it would eventually be easy to include AData in more complex

configuration files, as described in Section 4.9. XML parsers such as Xerces [35] also allow for data

validation, provided a document type definition (DTD) or more sophisticated descriptors sueh as

XML Schema, which is an essential feature for complex data representation.

In addition of being helpful for file input/output, AXML is used as the data format transmitted

over the network. It certainly introduces a significant amount of overhead due to the redundancy

of information in the XML format, but, where necessary, it would be possible to write a converter

to serialize and deserialize paekets sueh that the quantity of data pushed on the network would be

reduced.

30

4.3 MODALITIES

The abstract factory design pattern is used in order to instantiate custom converters from

dynamicaIly loaded libraries. Henee, facilities are provided in order to extend the library for users'

needs. The rationale of using the AData library in the current framework cames from the fact

that the information that passes through the data pipeline is not known at compilation time, but

is configured by users at runtime. More details are presented in the foIlowing sections as to where

AData are used in the current framework.

4.3. Modalities

Considerable research effort has been invested on multimodal systems in the past decades [10,

55,96]. Multimodal interaction aIlows for rich communication between the user and the system.2

Modalities, being inputs or outputs, are usually associated with devices (e.g. data glove, video

camera, mouse, video card, sound card), whose interfaces are not trivially adaptable for every

encountered application. This section presents class interfaces that aim at providing an abstraction

such that several input and output modalities will share a common data format and pipeline.

4.3.1. Input Modalities. Systems such as VRPN [90] make the differences between several

deviees transparent to a user with the introduction of a common interface, leaving the problems

of data pipeline and virtual world management unresolved. U nlike VRPN, the current software

framework integrates input and output modalities as weIl as the virtual world in a standard data

pipeline that is configurable at run time. A VRPN stream could however be used as an input

modality of the current framework given that there exists a data converter which transforms data

from VRPN data types to the AData format. In fact, any kind of event-based input modality is

technically supported, though introducing an overhead when building input tokens.

Figure 4.4 shows the class structure needed for a vision-based gesture recognition input modality

implemented for meeting the associated design goal and showing the flexibility of the proposed

architecture. The DynamieGestures class inherits from the InputModality class and implements the

abstract method emitToken. In order for the input modalities to be as generic as possible, a recognizer

is aggregated to the modality instead of being inherited. The DynamieGestureRecognizer class also

inherits from the ACE_Thread class in order to spawn a worker thread that reads data coming from

deviees or from the network. However, the class is not specialized enough to implement the sve

2 Multimodal systems however need a modality integrator, which is ongoing work in the SRE laboratory that is planned to be integrated with the current framework.

31

4.3 MODALITIES

I"""ModaIily 'ifm_name: sld::string 'ifm started: bool

'IogOala() .«BImI,sct» etritTokenO • <<virtual» finiO .<<virtual» in~O .<<virtual» startO .<<virtuol» stopO

~ OynomicGeslure.

HMMGeSureR9COgnizer (roM o,n •• Io~cvnllo,,)

'ifm_emissible T okens . sld::map<int, std::slrir... dedWol ... 'ifm_spotledWords: std::slack<Exten

1'«BImII8Ct» <NC{) .<<virtual,const» getGeslul9Applic

1'<<virtual» addOnlineOalaO .<<virtual» finiO .<<virtuol» in~O

ationPoir ...

----------~

OynsrricGeetUlflFlecognizer 1

(fnm DynoimioGulureRuognHion) M'n_recognize ftcIIDytr .. Ioa..~ItIotr}

'ifm_done: bD 01 1 1

·emilTokenO 1'«8Im1rsc/» ......:0

1

~n~ .«Yirtuol» finiQ lIm_modal~y .«Yirtual» finiO

.«Yirtual» startO .«Yirtuol» inno

.«Yirtual» stopO .«Yirtuol» slortO 1 .«virtuol» slopO

Vi sio n80sedHMM3esl"eReco gniz ... (from DynamloGesb.neReooClnttlon)

~ Qbj,ctsPQljtjgos' dd' 'ysctQr<PQinf

.«virtuol,const» gelGestureApplicatil .. 1'«virtuol» sveQ .<<virtual» finiO .«virtU81» in~O .«virtuol» porformRocognHionO .<<virtual» slopO

FIGURE 4.4. Input modality c1ass structure example

method, which is the ACE_Thread's entry point method. It is an example of the adapter design

pattern [37], which allows classes to work together even if they have different interfaces. In the

present case, it decouples the ACE_Thread interface from InputModality.

Following the class hierarchy is the HMMGestureRecognizer that inherits from ItiHMMOnlineClas

sifier, the hidden Markov models classifier implementation of the LTI-Lib library [79]. The svc

method is not implemented in this class either, because it is still hardware independent and not

specialized. It only provides a hidden Markov models interface that can be used by any kind of

device that needs such a recognizer. The addOnlineData method with its data vector parameter

should be called in order to initiate the recognition process. More details on the gesture spotting

and recognition algorithm can be found in Section 3.4.2.

When a gesture is recognized, the corresponding input token is emitted by a call to the emitToken

method. Finally, the concrete implementation of the svc method belongs to the VisionBasedHMMGes-

tureRecognizer class. The thread method actually starts a server and waits for incoming data, calling

static call-back functions corresponding to the incoming messages. Feature vectors are calculated in

the performRecognition method and then fed back to the associated recognizer in order to continue

the recognition process.

A data logging mechanism was implemented in order for users to be able to diagnose probJems

in the system, or simpJy play back data that was recorded using the same Jogging mechanism. The

InputModality class finalJy allows for interfacing numerous kinds of devices that can be accessed

locally or through the network, depending on the available hardware configuration.

32

4.3 MODALITIES

Ou/pA Modally Di.play w.m_name: std::string (ham Ot,phiœOutpuQ Display:D

w.m_.enderingBehIMou,: std::string w.m_height: ;,t ( .... G".hl..o",,~

w.m_type: std::string r<J- w.m_width : int <J-- .«virtual» fini()

.«virtual» finiO ·setHeightO .«virtual» iniIQ

.«virtual» initO ·setWidtl() .«virtual» renderWorldQ

.«viltuat» rendetWorld() .«vi~ual» renderWoridO

FIGURE 4.5. Output modality cIass structure example

4.3.2. Output Modalities. On the other side of the data pipeline are located the output

modalities that are meant to represent virtual data in the real world. In desktop-based computer

systems, the monitor is often the only output channel through which a user receives information

from the virtual world. In virtual reality, three-dimensional display systems are typically used along

with immersive environments in order to render a virtual world as realistically as possible, such that

a user will feel immersed. Several systems limit users' sense of immersion to visual effects [53,64].

However, vision is not the only human sense that would benefit the virtual data to be rendered:

sound is a trivial way to provide feedback to the user, and haptic devices are becoming very reliable

in realistically rendering tactile effects. It is therefore important for a multimodal framework to be

adaptable to different needs and provide a way of passing data that is sufficiently generic, such that

it could theoretically handle any kind of output modality a user desires.

The base class that is the interface from which concrete output modalities inherit is Output-

Modality (Figure 4.5). The interface method needed to be implemented is renderWorld, which should

be specific to every modality type. The default operation is to calI the render method on every entity

if rendering is needed, since it is possible to disable the rendering of an entity in order to hide it from

the real world. Output modalities own an optional rendering behaviour member attribute, such that

only entities owning this behaviour will be rendered. More details on the behaviour mechanism will

be presented in Section 4.5.

The output modality example shown in Figure 4.6 is an implementation of an OpenGL three-

dimensional display. The implementation of the renderWorld method sets the perspective view matrix

for correctly displaying 3D objects, calls the output modality's renderWorid method and sets the view

matrix back to its original value. Similarly, a 2D OpenGL display will set the orthographic view

first, render the entities that need to be rendered in 2D and restore the original view matrix.

33

4.4 INPUT TOKENS PRINCIPLE

1 MO~DlI 1 OgenGL API 1 1 Wodd 1 Entjty 1

œndeNVorld 0 Sot p ersp octive view !

~----------------~~ ,end~,

! GL commands

~---------------~ """-------------------~-----------------

Resto,e initial view :

~---------------_:~ 1"""---------------

FIGURE 4.6. Sequence diagram for the rendering cali on a three-dimensional OpenGL display output modality

In order to use already implemented OpenGL output modalities, a user has first to create an

OpenGL rendering context and calI the renderWorld method from the thread in which the rendering

context was instantiated. For example, in a Microsoft Windows MFC implementation, caBs to the

rendering functions have to be performed in the view's OnDraw method that originates from the

main loop. Pointers to output modalities are previously retrieved from the output manager. In a

GLUT application, caBs to the rendering methods occur from the draw function that also originates

from the main program's loop.

No other types of output modalities were implemented in the course of this thesis, but an the

building blocks are in place in order to do so. Rendering caUs are not specifie to particular devices

or software interfaces and thus, several kind of systems are possibly adaptable to the one provided

by the current framework.

4.4. Input Tokens Principle

It was mentioned earlier that input tokens are emitted from input modalities to be parsed

subsequently by the interaction manager. In this section, details are provided as to the composition

of input tokens and their role is explained in the generality of the framework.

34


InputToken ~_instance : int ~ needPublish: bool ~_probabilny : double ~_source : std::string ~_timeStamp : unsigned long #rn_dala Il AData li ~ tokenlD: sld::slrilg

(fromAST~

~deserializeFromDalaQ ~ublishQ ~«consl» gelDataO ~«const» seriaüzeToDalaQ ~«friend» 0 peral 01'«0 ~«friend» 0 peral 01'»0

FIGURE 4.7. Input tokens UML class representation

The 1 n putT oken 's UML class specification can be seen in Figure 4.7. A list of the most important

parameters that define an input token's content follows:

• rn_tokenID is the string that identifies a token. It is usually meaningful to a user in order

to facilitate the debugging.

• rn_probability is the probability of occurrence of a given token. It typically ranges from

o to 1. Log-probability can also be used, in which case the value will be negative.

• rn_tirneStarnp is the time at which a token was emitted. There is no synchronization

between different computers distributed over a network, but it would be possible to impIe-

ment a time synchronization service as in VRPN [90] or CORBA's ORB time service [68].

• rn...source is the source that identifies from which input modality a token originates. This

attribute is typically used when tokens are integrated for multimodal interaction. It is

a way of knowing if a token was emitted by speech, gesture or another input modality.

When coming from a remote computer, "remote_" is pushed at the front of the source

string in order to identify that the token's origin is not from the local hosto

• rn_data is used as a generic parameter container of type AData. Any kind of AData,

which implements most commonly used types, can be stored as a parameter in an input

token. This allows for passing the context of occurrence of an event as weIl as any other

information that would potentially be of interest for the token parser and the action's

invocation method.

• rn-IleedPublish identifies whether or not a token needs to be published on the network

if there are client connections. By default, a token is not published because it might not

35


be relevant for remote virtual environments. The publication variable can, however, be

set with the publish method.

• m-Ïnstance is a token instance number that is set in order to keep track of the sequence

in which they are created by the different input modalities and modalities integrators.

The input token interface allows for serialization and deserialization to or from AData. This

feature is particularly useful when reading tokens from streams such as files or network since AData

are convertible to and from XML. Insertion and extraction operators are provided in order to write

a token to a stream or read a token from a stream.

The rationale behind the use of such generic tokens is that the types and members of data

structures, which have to be specified at runtime, are not known beforehand. Since AData allows

for the composition of most commonly used data structures, it is justified to accept the overhead

that such a container brings to the system in order to gain a general knowledge representation.

4.4.1. Input Token Example. An example of input token use is the following: suppose

a user performs a circular shape with the mouse. The MouseHMMGestureRecognizer c1ass will emit

a "moveCursor" token every time the physical mouse is moved. The data parameter associated

with this token is the current position (vector of length 2) and the identification string of the

mouse that moved.3 When the "circ1e" gesture is complete and has been recognized with the

method described in Chapter 3 as being an actual "circ1e" gesture, the corresponding token will

be emitted with the center and amplitude of the performed gesture stored as parameters in the

AData container. In the case of a "circ1e" gesture, the amplitude parameter is the diameter of the

performed circ1e. The two parameters are represented using double precision numbers, the center

being a two-dimensional vector and the amplitude a single number. These are stored in a Map that

has String as keys, respectively "applicationPoint" and "amplitude". After an idle time, typically

half a second,4 a "still" token will be emitted with the current mouse position as a parameter. AIl

previously mentioned emitted tokens will have their source attribute set to "mouseTracker" in order

to identify their origin.

Another example of a token emitted from a speech recognition algorithm would be the following:

suppose the utterance "Delete the blue chair" is recognized. The speech recognizer is assumed able

3The format of the mouse identification string is "mouseX" where X is the mouse identification number. 4Half a second is an empirical arbitrary value.

36

4.5 WORLD, WORLD ENTITIES AND HOW TO MANAGE THEM

to separate a sentence in its words and parse a grammar that has a dictionary containing word

categories. Therefore, the emitted token will be identified as "delete" and the parameters will

have the value "blue" and "chair" corresponding to "objectType" and "colour", as parsed by the

grammar. The rest ofthe processing is to be computed in the interaction manager, which will invoke

actions associated with the "deI ete" token on corresponding entities.

4.5. World, World Entities and How to Manage Them

A conceptual model of the virtual world is needed to help the framework's developers under

stand how to approach a given problem, such that they can adapt their specific application to

the current system. The proposed world model is inspired from APIA (Actor-Property-Interaction

Architecture), which is a conceptual model developed at Laval University by Bernier et al. [3].

4.5.1. Components Description. In APIA, Actors are abstract virtual objects (e.g. sub-

marine), and do not contain any concrete attribute. Properties implement the actor's attributes

and can be of any type, as specified by the AData (e.g. mass, volume). Interactions are the links

between the actors (e.g. Archimedean force), where calculations occur in order to further modify

the properties. Other characteristics that define an actor are its characters. A Character is a group

of properties that define the dynamic characteristics an actor might be able to show (e.g. float

able). Characters also define other relationships between interactions and actors that allow for more

general management of the invocation of interactions.

However, as the APIA architecture is still under development and remains too complex for

the current intended applications, we devised a simpler alternative for our purposes. Based on the

APIA concepts, a simplified conceptual model for the virtual environment was built, though more

restrictive for multimodal-based world management (Figure 4.8). Actors become the World and

WoridEntities, Properties remain the same concept, interactions become Actions and 8ehaviours share

sorne characteristics of APIA's characters. A detailed description of the pieces that compose the

proposed virtual world model is presented in the enumeration below:

• WorldEntity: a world entity is the world itself and any constituent of the world, as seen

in Figure 4.8. It is something that must be rendered by an output modality, and its

properties modified by actions.

37

4.5 WORLD, WORLD ENTITIES AND HOW TO MANAGE THEM

Ac/ion v.m_activationTokan: std::string v.m_bahlNiourList: std: :set<std::string> v.m_name: std::string v.m_nowStata: std::string #m_actionData ~ v.m_stataCondition: std::string (fromASTL) v.m undoabla: bool

#m-prope~ ·«sb<llract» doAppMJ .«con5t» hlMlBehlNiour() .<<'<irtual, const» validataEntitia.O Work1EnMv .< <'<irtual» addT oUndoListO

.~ v.m_bahlNiourList: std: :map<std::string, Data::AOata> v.m_nam. : std::string v.m-pubic : bool

World ~ instanca: int v.m_entitiesMap: std::map<std::string, WorldEntity'> 1 ~ -render : bool v.m_hook : WorldHook • ""':resourcaLoek.d: int

•• ddEntityO -f> "ddBehlNiour() ~niO .addPropertiesQ "oadHookO .... '" ~.kePublieO .... m"'.EntityO .«ab<;/racl» rendet() .«00n8t» getEntityO .<<oonst» getPropertiesO .«virtual» renderQ .«eon8l» getPropertyQ

.«eonst» hlMlBehlNiourQ

.«eonst» isPublieO

.«virtual» finiO

.«virtual» initO

FIGURE 4.8. World, world entities and actions class structure

• Property: a property is a named AData that is aggregated in a world entity in order to

provide it with sorne attributes that will be exploited by the actions, as weIl as by output

modalities in the rendering method. Properties are stored in a Map whose key is the

property's name contained in a character string.

• Behaviour: a behaviour is defined as the way a world entity should react to input tokens.

It is a characteristic that defines the behaviour of an object in the virtual world. One

or more properties are associated with every behaviour, such that an entity with a given

behaviour will necessarily own those properties.

• Actions: an action may be viewed as the implementation of one or many behaviours. It

is where the actual calculations and property changes are made. Actions are applied to

entities by taking input token's AData attributes and adjusting the entity's properties

according to what the doApply method implements. An action is associated to an input

token's identification string, which will trigger its invocation under certain conditions when

emitted.

4.5.2. Examples. An example of a configured three-dimensional model world entity can be

seen in Figure 4.9. It owns the behaviour "pickable", "translatable", "deletable" and "rotatable".

It should be clear that these behaviours are respectively associated to actions "pick", "translate",

38

4.6 INTERACTION MANAGER

3D model

""""-8oho,;o", " El t",,,,Iata~. , ........

deletable

Property: type

\ .. , ............... , ....... , ........... ,_ .. r-----<-=----,

picked: position: Boolean Vector3

rotatable

rotation: Vector3

FIGURE 4.9. 3D model configuration example

"delete" and "rotate", among others, sinee the action to behaviour mapping is not neeessarily one-

to-one. Properties are also part of the requirements that will allow the previously mentioned actions

to be applied to the 3D mode!. In the current example, three different behaviours rely on the

"position" property, which in fact is unique. This property is therefo~e shared between the different

actions.

A concrete example of an action is "rotate" that is activated by the "moveCursor" token when

in "rotating" state. In the doApply method, the mouse position is retrieved, having prior knowledge

that the given parameter is stored under the name "position" in the input token's parameters. The

"rotation" property of the affected world entity is also retrieved and the correspondenee between the

position attribute and the rotation property is made, with the appropriate calculation and mappings.

As noted in the previous paragraph, a considerable amount of information has to be known

before using the proposed architecture. Input modalities, actions, world entities and behaviours

have to be consistent in the way each of their properties' and parameters' names and types match.

Documentation on the currently implemented components is presented in Annex B.

4.6. Interaction Manager

The interaction manager is the heart of the framework because it is where decisions are made

as to whether or not an action will be applied to an entity given an input token. The interaction

manager's principle was briefly discussed in earlier paragraphs, but a more detailed description of

the proeess by which actions are triggered by input tokens is presented in this section.

39


r---~-c=------"-----

WoIIdEnlity ""'_blhaviourList: std::map<std::stri ... ."." mutex : ACE Thread Mut .. ifm:name : sld::s1ring -w.mJlublic: bool ","_rasourceLocked; int

~etMute.O ~OCkRosourcoO "unlockResourceQ ·«sbat'8Ct» rendet() "«eonst» getPropertyO .«ecnst» have8ehlMour() ~<conlt» havaSehaviour() .«const» isPublicO

IntoractionManager Action iIm_IctionMutex: ACE_Thread_Mutex

'ifKn actiYeActions: std::list<ActDrf> """ -redoActions: std::list<ActionTrackef> .. w.m=stll.: std::string "mJnte"~ionM.n.g.r .:?~,."t» doAPWQ ffm undoAction.: IId::list<ActionTracker> ~ __ ---j .«const» getActNationTokenQ .-m:actionUst: std::muttimap<std: :string, Action·) .«const» getNewStataO

""_networkManager: NetwolkManager .«eonlt» getStateConditionQ 6 .«eonll» haveSehaviourO --.,--- f'parseTokerO <<friand permission» .<<\IirtuaJ, const»'t'IlidateEntitiesQ

"""cO .<<-.irtual» addToUndoUst() "ddTokenO ~nO ~nitQ ~ockEntly() "'edoLastActionO 'ndoLastktiœ() "nlockEnt~y()

ActionTracker ~_oldProperties : std::map<std::slring, Oata::/ ...

.«~rtu.I>~ undoAndUpdeteO

InputToken "'" ~ instance; int

"'" w.m - n •• dPublish : bool w.mJ>rob.bil~y : double ",,"_source: std::string IfnUimeStamp : unsigned long w.m_tokonID: std::string

·.ddDataO .«const» getOataO

FIGURE 4.10. Interaction manager and auxiliary classes

As can be seen in Figure 4.10, the InteractionManager class is linked to InputToken, WoridEntity

and Action classes. The Interaction Manager class inherits from ACLThread that is an object-oriented

implementation of a thread from the ACE OS wrapper library [81]. The ACLThread class provides,

among others, methods for the management of a FIFO data stucture in which messages are queued

using the putq method and dequeued using the getq blocking method, which unblocks when a new

message is put in the queue. The getq caUs are invoked from the svc method, which is the thread's

entry point. The svc method does not exit the loop until a "NULL" message is put in the queue,

which happens when a caU to fini is made.

Algorithm 4.1 Interation manager's token parsing algorithm

Get a token from the queue (getq) {An input modality asynchronously puts a token in the queue} parseToken for every action do

if token ID = action activation ID and action activation state = system state then active actions list {::: action

for every active action do for every world entity do

if world entity owns all behaviour and behaviour data match then potential entities list {::: world entity

valid entities list {::: validate the potential entities toward the action for every valid entity do

Apply the action to the current entity Publish the token

As can be seen in the pseudo code of Aigorithm 4.1, the getq method returns when an input

token is put in the message queue, which consequently executes parseToken. This latter method will

40


retrieve the tokell's identification string and compare it with every action's activation string. Wh en

an action's activation string matches the token's identification string, it is put in a list of active

actions that could be applied if the action's invocation state also matches the current system's state.

This interaction manager's "state" attribute is the mechanism that was chosen in order to impose a

context on the action's execution. The state's internaI storage is a character string that is set from

the configuration file and adjusted every time an action is applied successfully. If the state condition

is set to "any" , the action is applied regardless of the system's state. An example of a state condition

is "translating", which allows actions such as "translation" and "drop" to be applied.

The next step in the validation process is to find the world entities that are possibly affected by

the active actions triggered by the incoming token. The first test is to verify if the entity owns every

behaviour the action needs in order to be applied. An entity should possess all behaviours that an

action necessitates, otherwise errors will occur at run time if properties required by the action are

not owned by the world entity, or if their types do not match. Secondly, the data associated with

each of these behaviours and the input token's data must be equal in order to validate the action's

execution. An entity's behaviour data is stored internally in a Map whose key is the behaviour's

character string. A typical example of behaviour data use is when drawing or moving cursors on

the screen. Tokens are sent every time a mouse event occurs with one of their parameters being the

cursor's identification string. The cursor drawing action has a behaviour data that should match the

token's cursor identification data string. This verification ensures that if there are multiple mouse

instances, a virtual cursor will move only when the identification strings match.

A valid entity is additionally one whose properties match the action's prerequisites for the

invocation. Once aIl the possibly valid entities are verified towards the input token, they must be

confirmed towards the action. This pro cess takes place in the validateEntities calI of the Action class.

The rationale for the validation process is that several actions might have the same criteria as to

which entities among several are suitable for modification. It is therefore an obvious method of code

reuse and easier error tracking since the next step for the action execution process is to apply the

action separately to every valid entity. The action's invocation is the last step of the data pipeline

that ranges from the input modalities to the virtual world. This is where input token's data members

are considered, and used to modify the entities' properties according to what the action actually

41


Data saving process ,----------,

Entity

Properties

Q .... OQ ... -"'-. -'"

........... L-______________ ~

To undo list

Action tracker

.:.~==~------_.--/

Undo process From Entity

undoto

redOIi~ Properties "-

Action tracker

lP°j) q .. q .. .~ ................. - -.. -........... ---f/ /

--~~-----

FIGURE 4.11. Data saving and undo processes

implements. Finally, an input token can be published on the network if peers are connected. This

last step will be explained in Section 4.8.

The interaction manager class provides other utilities that help entity management. Entity

locking and unlocking methods are provided for actions that require an entity's exclusivity to be

able to lock and unlock it. Locking is done per network instance to ensure data coherence, which

means that requests for locking pass through the network communication process, described in detail

in Section 4.8.

The last feature the interaction manager provides to the system is an undo/redo facility. Since

the current system can be used for human-computer interaction, it is important to provide a way

for users to undo actions that are unwanted or incorrectly performed after recognition. Undoing

actions is the consequence of a basic Hel principle from Nielsen [66J who urges designers to "help

users recognize, diagnose, and recover from errors". Providing undo facilities helps users to recover

from errors since they can go back in the history of applied actions.

As seen in Figure 4.10, the inte~action manager contains two lists of pointers to the ActionTracker

class. Instances of this class are used to store entities' properties temporarily before an undoable

action is applied to the entity. Figure 4.11 shows the pro cess by which properties are saved and stored

in the undo and redo lists aggregated in the interaction manager. A pointer to an ActionTracker

42

4.7 TAKING ADVANTAGE OF THE CONTEXT

object is pushed on the undo list when a user performs a reversible action. Similarly, the last

ActionTracker object in the undo list is popped, updated and pushed back on the redo list when an

action is undone. The redû list is emptied when a new action tracker is pushed on the undo list.

4.7. Taking Advantage of the Context

Research has been published on the influence of the context on recognition rates and system

performance [16,50,62,65,80]. In fact, the context of an application provides important clues

as to what a user could be doing in the virtual environment at every moment. The context can

be defined as the information set that influences observations. The former could be the user's

orientation towards objects, the objects' position and state in the virtual world, the last performed

action or any clue that would help the system predict which actions are the most likely to occur

next. Observations, on the other hand, make up the data set that originates from input modalities

in order to come to a decision at a given moment. For example, if it is known that a user selected

a virtual object in the world, it is likely that the next commands will be applied to that object.

Those commands are known since an object has a finite set of applicable actions determined by the

associated behaviours.

Many techniques exist in order to retrieve the context out of a virtual environment. It is possible

to consider the user's state progression and simultaneously examine the objects' state in order to

draw a relation between the two that would define whichever action is likely to happen. For example,

it could be observed that a user's hand targets a defined virtual object just by looking at the hand

position's trajectory in space, and interpolating the target position the user is trying to reach. This

method is interesting because it uses the movement dynamics in order to predict actions that are

likely to happen. However, this technique is only applicable to gesture-based interactions since it

would not be possible to obtain any kind of context from, for instance, raw speech dynamics . Using

a grammar can be an interesting way of providing context to a system. There are many types of

grammars, among which sorne are stochastic and others are sim ply implemented in the style of a

deterministic finite state machine (FSM).

As seen in Figure 4.12, the input modalities and the interaction manager take part of the

context grabbing process. It is the input modality's role to call the context grabber's method

getEmissibleTokenslDs, which is meant to provide a list of tokens that can be emitted, given the

43

4.7 TAKING ADVANTAGE OF THE CONTEXT

+rn_ni ara cli onManager InteractionManager IlrpulModalty 1

~ aclionLisl: sld::multimap<sld::slring, Aclion*> 1

+m _inpulModalilies

t #m_interactionManager

+m_contextGrabber ConlexlGrabber

.«const» gelEmissibleTokenslDsQ

FIGURE 4.12. Context grabber's class interface

current context. The information collected from the context grabber is typically passed to the

recognizers in order to influence the initial probability of known models.

The HMMGestureRecognizer is the currently implemented recognizer that takes advantage of

the context. Hidden Markov models of known gestures are stored in a list and identified by their

corresponding emitted tokens. When the context information is not used, aIl models have the same

initial probability of occurrence in the hypothesis generation stage. Therefore, if two gesture models

are similar enough to confuse the recognizer with a given gesture sequence, recognition errors will

arise more often even though only one among the two gesture models would have made sense to

occur. However, when context is provided, constraints are applied on initial hypotheses in order to

restrain the number of models for which an associated gesture can occur. There are two positive

effects of that restriction: first, fewer recognition errors are likely to happen. Second, the recognition

process will compute faster because instead of generating hypotheses for every model, only the ones

that can occur, given the context, are considered, thus reducing the amount of hypothesis likelihoods

to compute.

The question now is how to know which tokens are likely to he recognized, or in other words

how to build the context? In the current framework, the context grabber's implementation uses

a finite state machine (FSM) in order to get the conditions in which events can occur. The state

attrihute is stored in the interaction manager object as a character string. The default state of the

machine is "idle" , from which actions can take it to another value while being applied to the world

entities. The state condition that an action must meet in order to be part of the context is stored

in the Action class as a character string. The state in which the interaction manager is to be set

after an action's invocation is also stored in the Action class instances. The latter two attributes are

user-defined.

44

4.8 NETWORK MANAGER

For every calI ta getEmissibleTokenslDs on the context grabber, the action list is parsed. The

activation token of actions whose state condition matches the interaction manager's current state is

pushed on to the list of token identification strings that can be emitted at the moment. It should be

noted that if an action can be triggered regardless of the interaction manager's state, its activation

state should be set to "any". The activation token of an action having "any" as its state condition

is always added to the list of tokens that can be emitted, hence, is context independent.

4.7.1. Example. A concrete example of the context grabbing pro cess using gesture input

modality happens when a virtual object has just been picked for translation. The interaction man

ager's state is immediately set to "translating", allowing the "translate" and "drop" actions to be

applied. The "translate" action's role is to make the virtual object follow the virtual cursor, which is

being displayed by "moveCursor" and "traceCursor" actions. When the "drop" action is triggered,

the interaction manager's state is set back to "idle". While in "translating" state, the virtual cursors

indicating the trackers' position keep moving since the corresponding actions are activated regardless

of the interaction manager's state, their activation condition being set to "any".

It would be possible to build more sophisticated context grabbers using information originating

from the virtual world as weIl as from the user's status. Adding a grammar would also improve

the context grabbing feature of the system since it would allow for more general relations to exist

between utterances [62]. For example, suppose a speech recognition system in which every virtual

object would have a descriptor word naming it. The speech recognition system would therefore

attribute a larger start probability to nouns corresponding to objects present in the virtual world.

Likewise, the verbs' initiallikelihood would increase for the ones whose corresponding actions can be

applied to the virtual world's objects. Additional information on the context could also be of interest

given other modalities such as a gaze or eye tracking, so that the system would know where the user

is looking at every moment, putting more constraints on the most likely actions to be triggered.

4.8. Network Manager

One of the framework's design goals is to provide facilities to share a virtual world between

several people geographicaIly distributed over the planet. Networked virtual environments (NVEs)

are known to offer services that interconnect remote environments and allow users to take part in

coIlaborative or competitive experiences.

45

4.8 NETWORK MANAGER

ServerAcceptor

1 ClientConnector 1 v.mJlort: int NetworkManager

1 ·«virtual» openConnectionO 1 "'«Yirtual» svcQ (TromLoglClIVIow)

.«virtual» startQ v.m_networkRequestsMap: std::map<std: ...

1

.«virtual» stopQ ""equestlDQ

ConnectionHandler 1 ·connectToQ

1

·grabResourceQ

·svcQ Iffin. dataServer ~sMasterServerO

",«virtual» handlejnputO ~akePublicO

.«virtual» closeQ V ~akeRequestQ

.«virtual» finiO Connectionlnstantiator ~ublishPropertiesQ

.«virtual» openQ ~ublishTokenQ

~ .... _"m., .'o"ri •• ~leaseResource() .«virtual» addConnectionHandlerQ

·getNameQ .«virtual» finiQ .«Yirtual» startO

/ .«virtual» handleReceiYedDataQ

.«virtual» stopQ .«virtual» initQ

#ni connectionMtager

.«Yirtual» removeConnectionHandlerQ

ConnectionManager v.m_connectionsList: std::list<ConnectionHandler">

·sendToClientsQ .«abstract» handleReceiYedDataO .<<Yirtual» addConnectionHandlerQ .«virtual» finiQ .<<Yirtual» initQ .<<virtual» removeConnectionHandlerQ

FIGURE 4.13. Network manager and surrounding classes

Several systems have been developed in the past that provide networking services, which allow

for virtual world sharing and coherence [17,18,41,93,100]. Most of them aim at providing ways

to ensure data coherence between different virtual world instances. Researchers have developed so-

phisticated synchronization systems and data caching in order to make use of network resources as

efficiently as possible, and reduce network latency. In the course of this thesis, entities synchroniza-

tion and world coherence was implemented, taking as an inspiration the work of MASSIVE-3 [42].

Methods are provided that send events over the network as weIl as entities' properties in order to no-

tify remote instances of status changes. Entities synchronization is implemented such that two users

will not be allowed to change an entity's property simultaneously. The proposed class architecture

can be seen in Figure 4.13.

NetworkManager is the interface class that the Instance object has access to. It inherits from the

ConnectionManager class, which holds a list of connection handlers. ConnectionHandler is the class

whose object instantiations will receive data from or send data to peers. To initiate a connection, the

network manager creates a ClientConnector that connects to the specified server with the connectTo

method. The connection handler then adds the newly created connection to the list and starts a

46

4.8 NETWORK MANAGER

receiving thread that waits for incoming data on the socket until the connection is lost, after which

it un-registers itself. When a peer sends data, the thread is waken up and the connection manager's

handleReceivedOata caIl-back method is invoked.

In the present case the concrete implementation of the connection manager is the network

manager whose data handler method is executed. This latter method rebuilds packets that arrive

incomplete due to packet splitting over the network, and pro cesses the valid incoming data. Since

the network managers exchange data in raw XML format, the detection of packet ends is effortless.

It is also trivial to interpret the XML packets because they own an attribute in their root no de

that specifies the packet type. Currently supported packet categories are "token", "props", "lock",

"unlock" and "IDRequest", which are described below:

• token: contains an input token that was serialized and sent over the network

• props: contains sorne entity's properties that were serialized and sent over the network

• lock: an entity locking is requested

• unlock: an entity unlocking is requested

• IDRequest: a new client is connected and requests its identification number

It is possible to register other types of packets with a NetworkRequestHandler that knows which

packet type it is meant to receive. Two options are available for the caller: wait on an event to

be signaIled when a proper packet arrives, or register a caIl-back function to be caIled when the

corresponding packet type is received. Such packet type registration is used when waiting for calI

replies. For example, when a new instance asks for its instance ID, it registers a packet of type

"IDRequesLack". The reply packet contains an AData member handled by the receiver. In the

present case, the contained data is the actual requested identification number.

In order to create a server, a calI to the makePublic method of the NetworkManager class has to

be invoked, which st arts a new thread and listens on a specified network socket port. When a client

connects to the port, the server accepts the connection and creates a connection handler that will

eventuaIly be used to dis patch received data. Clients and servers are respectively implementations of

ACE's pattern classes ACLConnector and ACE...Acceptor [81]. These classes provide utility methods

that manage basic network socket functions as weIl as handlers for incoming events.

47

4.8 NETWORK MANAGER

Interaction

.M.ini.w Lack Entity

0 Try ta lack entity

Try ta lack entity until master server;

~ : Grab lack

~

~ctian

Publish Taken ar praparties

Send data ta paars

Send data ta ather peers

~

Parsa takan ar update praparties !

Unlack enUy "'.

""1 Asynchranaus 1 Try unlack .. cali

-u

FIGURE 4.14. Network manager's sequence diagram

Given the network structural design described ab ove , an explanation of how the coherence

between several virtual environments is managed can be seen in Figure 4.14. The network manager

is actually used to ensure consistency between several replicas of a virtual world shared over a

network. Two types of instances exist, which are clients and servers. Clients can connect to servers

and then become servers themselves, to which other clients will be allowed to connect. The "master

server" is the one that is first instantiated, run, and which does not connect to any other server

thereafter. It is first responsible for assigning each client an identification number that is used to

know from which peer data packets originate. The second utility of the master server is to manage

the entity locking strategy.

48

4.8 NETWORK MANAGER

Before modifying an entity's properties shared amongst multiple instances, it is necessary for an

action to "lock" that entity such that only the instance that owns the lock will be able to modify the

given properties. The Iocks are managed by the mas ter server, which keeps an internaI representation

of the world entities locking status. When a "lock entity" call is invoked on a client, a lock request

is sent to the associated server, which requests its own server until reaching the "mas ter server".

A message is sent back, notifying the caller if the entity was Iocked. A drawback of this Iocking

strategy is the lack of fairness among instances sin ce entities are locked on a first come first served

basis. Another problem with this locking scheme is the time it cou Id take to get a response from the

master serverj if the number of peers that a packet needs to go through to finally reach the master

server is too important, the Iatency might be unacceptable. The interaction manager would then

be locked while waiting for a master server's response, which is an unwanted behaviour. Solutions

however exist, as proposed by Singhal and Zyda [84], but were not implemented in the course of

this thesis since the current work is not focused on network performance, but rather on the software

framework's generality.

After an action's execution, the input token or the entity's properties can be published ta the

other instances, depending on the event that occurred. Generally, when the action does not involve

an entity property change, the associated input token is published on the network if it is of interest

to other peers. Likewise, when an entity's properties are changed during an action invocation, they

are serialized and sent to the connected clients. Once a peer's interaction manager receives an input

token coming from a remote location, it parses it as if it was coming directly from an input modality.

However, a mechanism is implemented in order to warn the remote interaction manager that the

token does not originate from the local instance by putting an indication in the input token's source

attribute. On the other hand, when properties are published, the corresponding entity is directly

updated with the new values. It should be noted that property updates and token parsing in the

remote instances are asynchronous. There is therefore a possibility of loosing the synchrony for a

short period of time, which is acceptable in most cases. The entity unlocking pro cess follows the

same pattern as the locking strategy.

The presented network communication algorithms would benefit from optimization in terms of

quantity of data transmitted over the network. Raw XML format is not the most compact form of

data, which leads to overuse of bandwidth. We performed a rough estimate of the bandwidth needed

49

4.9 XML CONFIGURATION FILE

for transmitting XML data, and found that around five times more bytes must be transported than

when using raw data, based on the specified format. Preliminary tests have been performed by

running multiple instances of the framework on a local area network, showing that the developed

system is able to maintain the virtual world coherence for multiple distributed instances. More

exhaustive testing of the network communications is left as future work, and should ideally involve

communication over the Internet in order to verify the algorithms described above.

4.9. XML Configuration File

This section presents the XML configuration file that a user builds in order to fit the needs of a

given application. In order to be exploited, the various software components must be initialized by

users either with hard coded values or through an XML configuration file parsed by an appropriate

interpreter. The former method does not offer the same flexibility as the latter, because the appli

cation should be recompiled every time a value is changed. However, the use of a configuration file

allows for flexibility and ease of use, such that non-experts wou Id be able to build one, eventually

with a GUI.

The XML configuration parser must be invoked in order to read the specified file and create

a DOM representation of the configuration parameters. Each file's section is then analysed, which

leads to software modules and objects instantiation according to the specified values. The XML

configuration file's author must know beforehand what content to specify in the file, requiring the

available components to be well-documented as to which configuration properties they expect from

a user. The different parts that compose the XML file are as follows:

• Input modalities: contains aIl the input modalities' specifications

• Output modalities: contains aIl the output modalities' specifications

• World: contains the world entities' specifications as weIl as behaviours owned by the

world object

• Grammar: contains all the actions' specifications

• Network: specifies the network parameters, being clients or servers

50


The next sections describe the format of each XML file component by presenting fragments

of a concrete example that is currently implemented. The experimental application shows a three-

dimensional world in which 3D models are placed and their properties modified using a mouse-based

gesture recognizer. For a complete XML configuration file, see Annex C.

4.9.1. Input Modality Node. An "Input Mo dalit y" node is meant ta create an input

modality that is added ta the input manager. The next XML code sequence shows an example for

the creation of a mouse-based dynamic gesture recognizer:

<InputHodality name-"mouseGestures" type-"DynamicGestures" >

< AXML > <map name-" instance Il >

<string name-"type" value-"HouseBasedHHHGestureRecognizer" />

<map name-"data">

<string name-"dataFile" value-"gestures .ges"/>

<int name-"smoothingBuffer" value- H3"/>

<int name-"buffer" value-"100"/>

</map>

</map>

<string name-"mode" value-ttevents"/>

<int name-"frameRate" value-"30" />

</AXHL>

</InputHodality>

The base node has two attributes; one is the input modality's name and the other its type. The

"type" attribute is passed ta an abstract factory class in arder ta instantiate an input modality of

the corresponding brand. The factory then tries ta load a dynamically linked library that has the

same name as the requested type, plus the file extension.5

The first child node has "AXML" as its tag value. This no de is meant ta provide initialization

data ta the newly created input modality. The XML file parser will create an instance of an AData

that will be filled with the data contained between the opening and closing AXML tags. In the

present case, the created AData is a Map whose two template arguments are of AData type. The

map's keys, which are of concrete type String, are found in each child node's "name" attribute.

The "type" child no de is the parameter for another factory, this time for the dynamic gesture's

abstract factory. In this case, the factory tries ta load a dynamic gesture recognizer of type Mouse-

BasedHMMGestureRecognizer, which is ta be found in a dynamically loaded library. The sub-child

5The standard naming convention adds a "d" at the end of the library name if the program is compiled in debug mode.

51


data map contains initialization data for the mouse-based gesture recognizer, namely the data file

that contains gesture models, and the lengths of different buffers used in the recognition process.

If needed, it is obviously possible to add other data, as long as classes that use it are modified

accordingly.

4.9.2. Output Modality Node. The "OutputModality" node is present in the configu-

ration file in order to instantiate the output modalities that will be added to the output manager.

Here is a typical example of such anode:

<OutputHodality name-"userlnterface" type-"Display2D" >

<AXML>

<int name-"frameRate" value- n 30" />

<string name-"behaviour" value-"drawOnTop"l>

</AXML>

</OutputHodali ty>

In the same flavour as the input modality node, the output modality has two attributes; one

that specifies the name and another that specifies the type of output modality a user wants to

instantiate. Output modalities are also created with a factory that will use a dynamically loaded

library if it cannot find the concrete type of the object in the known objects list. The AXML data

node contains data that is to be passed to the modality's initialization function. In this case, a frame

rate as weIl as a behaviour are specified, which means that in order to be rendered in this 2D view,

world entities will need to own a behaviour called "drawOnTop".

4.9.3. World Node. The "World" XML node specifies the world's content as weIl as entity

types and behaviours. Entities are contained in "WorldEntity" nodes, whereas entity behaviours

are stored in the "Behaviour" nodes.

<World>

<Behaviours>

<Behaviour name-"resetable" />

<Behaviour name-"placeable" 1>

</Behaviours>

<WorldEnti ty name="traj ectoryRight Il typez "mouseTraj ectory" > <AXML>

<map name-"color">

<int name-"r" value-"Q"/>

<int name-"g" value-"255"/>

<int name-"b" value-"O"/>

</map>

<int name-I'length" value- 1I 100"/>

52

< IAXML > <Behaviours>

<Behaviour !lame- "mouseTraceable" > < !XML >

<string name-"ID" value-"mouse1" />

</!XML>

</Behaviour>

<Behaviour name-"drawOnTop" 1>

</Behaviours>

</WorldEnti ty>

</World>


In the previous XML snippet, behaviours "resetable" and "placeable" belong to the world.

These two behaviours are used in order for actions "reset" and "place" to be executed on the World

object. Actions that do not involve particular entities, or that create new ones, should always be

executed on the world. The declaration of a world entity uses the same principle as an input or

output modality for its creation. Specifying the type invokes a caU to a factory that instantiates an

object if the type is known. Data that will belong to the entity is then specified in the AXML node.

Converted to AData, AXML nodes are added to the entity in the form of properties. For instance,

the entity named "trajectoryRight" will, after its creation, own the properties "color" and "length"

that are to be used in the rendering process. The "color" property is in fact the colour the mouse

trajectory will be displayed in when a cursor is moving on the screen, and "length" is the maximum

number of displayed data points.

Behaviours are specified as children of anode caUed "Behaviours". A behaviour is not a

con crete class, but rather a character string and its option al AData contained in a Map that are

stored in every entity. In the above example, the entity owns the behaviours "mouseTraceable" and

"drawOnTop". One thing to notice about the "traceable" behaviour is the AData that is associated

with it. This AData member is in fact one that the input token must match in order for an action to

be applied to the entity, as explained in Section 4.6. In the present case, it defines the identification

string to which the incoming mouse cursor identifier should be equal. Since there may be more than

one cursor's position sent, it is the way that was found in order to restrict the action's invocation.

As for the behaviour "drawOnTop", it means that the "userInterface" output modality will enable

the rendering of the mouse trajectory since the required behaviour matches. The "userInterface" in

53


the present case is the last to be drawn, which results in drawing the corresponding entities on top

of the others.

4.9.4. Action Node. The execution of actions is the core of the framework since it makes

the relation between the inputs and the virtual world's status. The following XML file segment is

the description of an action:

<Action type-"pick" activationToken-"translate" when-"idle" becomes-"translating" > <Behaviour name="translationPickable">

<AXML>

<vector name-"position" />

<bool name-"picked" />

</AXML>

</Behaviour>

</Action>

An action is created by the action factory, which instantiates an object whose type is found in

the attribute "type". The attributes also provide the activation token parameter, the state condition

("when") and the new system's state ("becomes"). Behaviours and associated properties are then

specified. In the current example, the action is "pick", which is meant to pick a virtual object.

It is activated by the "translate" token ID when the system is in the "idle" state, and puts it in

the "translating" state. Properties that are associated with the behaviour "translationPickable" are

"position" and "picked".

4.9.5. Network Node. In order to configure how the current instance will behave regarding

its network connections, a "Network" node can be included in the XML file. The syntax is the

following:

<Netllork>

<Connection type-"server ll port-"76849" />

or

<Connection type-"client" serverName="localhost" port-"76849" />

</Netllork>

The network manager (see Section 4.8) is configured given the "type" attribute of the "Connec-

tion" node. In server mode the "port" attribute specifies the port number the server socket should

listen on. In client mode, the "serverName" attribute specifies the host name of the server the client

should connect to, and the "port" attribute specifies the associated port number. If the "Network"

54

4.10 CONCLUSION

no de is unspecified in the configuration file, the default behaviour is to start a server that listens on

port 20202.

4.9.6. Discussion. As a conclusion on the XML configuration file, it should be noted that

it is the user's responsibility, when writing the file, to ensure that it is coherent and consistent with

the available resources and classes. It would however be possible to build a graphical user interface

that would allow configuring and certifying data coherence after the composition of an XML file.

4.10. Conclusion

In conclusion, a general and flexible framework for multimodal interaction was presented. The

software framework allows for several different components to be loaded at runtime in order to

meet the user's specifications, being hard coded or defined in an XML configuration file. A virtual

world model is defined as being composed of world entities, each having their own behaviours and

properties. Event-based input modalities emit input tokens that contain an event specification as

weIl as the parameters associated with it. The interaction manager applies actions on matching world

entities, given an input token. Context is provided as to which input tokens are the most likely to

occur, given observations on the world and the user. Networking facilities are implemented in order

to share a virtual world between multiple geographically distributed users. An input modality of

dynamic gestures was used in order to demonstrate the framework's flexibility as weIl as a basic

application, which will be described in the following chapter.

55

CHAPTER 5

Results and Discussion

In order to test the decisions made throughout the design process, experiments and performance tests

were conducted on the software framework and the gesture recognizer. Experiments on the gesture

recognition's reliability justify the chosen feature vector of a gesture data stream. A functional

application, implemented in order to demonstrate the framework's utilities in allowing for general

multimodal applications, is described. Performance tests show that the interaction manager is not

a bottleneck for having a large amount of entities in the virtual world while maintaining adequate

performance on the output side. A discussion of the framework's extensibility and flexibility is then

presented, based on the experience acquired while developing the application.

5.1. Continuous Dynamic Gesture Recognition under Several Conditions

Continuous dynamic gesture recognition is the core input modality implemented for this thesis.

It is therefore important to test and justify the basic design decisions. The choice of a proper feature

vector is the basis of the gesture recognition system since the features determine which information

is important in the input signal coming from the capturing devices. A mouse-based input modality

was used in order to characterize the various feature vectors considered. Gesture recognition rate

was measured with multiple position capturing devices: a mouse, a P5 data glove and a vision-based

hand tracker, all sharing the same interface. It should be noted that the gestures on which the

experiments were conducted were arbitrarily devised. In a final application, users would have the

freedom to do so themselves.

5.1 CONTINUOUS DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS

\., ... _ .... +--...... . 4==._ .... PC=--

...............

Place Pick Delete

FIGURE 5.1. Gesture set used for recognition tests

5.1.1. Choice of Feature Vector. The choice of the feature vector is of crucial importance

for a gesture recognition system because it is the only data, from the raw input stream, sent to the

recognizer. Several different feature vectors were taken into account, using a mouse-based gesture

recognizer input modality, in order to determine the most suitable one for the kind of application in

which the framework is intended to be used, that is, iconic and deictic gestures.

Certain feature vectors were immediately rejected because they were not compatible with the

projected framework's applications. For instance, the position vector cannot be used in the current

system because a gesture would always have to be performed at the same location in order to be

recognized. However, it is contrary to the familiarity of a gesture-based user interface not to make

use of location. A possible solution would be to train the hidden Markov models with very sparse

data in terms of gesture position. This would, however, lead to a poor recognition rate since the

sparseness of data results in a large variance in the gesture models and further, to the spotting

of incorrect sequences. Another rejected feature vector is the difference vector between successive

data points. The small variation in the values of this feature vector during hand movement causes

confusion for dissimilar gestures or random movement.

Experiments, described as follows, were conducted to determine which of the remaining pro

posed feature vectors should be the ones considered in the current framework.

(1) acquire multiple repetitions of each of the three gestures (see Figure 5.1) discretely, while

logging raw mouse input data in a file1

(2) perform the training of hidden Markov models with the previous data for every considered

feature vector

1 We chose twenty repetitions as an arbitrary number that was not too onerous for new users, yet sufficient to achieve reasonable fidelity in the trained models.

57

5.1 CONTINUO US DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS

Sequence 11 Recognition Features Place Pick Delete # ins. # subs.

Angle vector quantized * 1.0 1.0 1.0 1 0 Angle vector 1.0 1.0 1.0 0 0 Delta vector 1.0 0.3 0.8 3 2

Polar coordinates 0.8 1.0 1.0 0 0 Sequence 12 Recognition

Features Place Pick Delete # ins. # subs. Angle vector quantized 1.0 1.0 0.6 0 3

Angle vector* 1.0 1.0 0.9 2 1 Delta vector 0.9 0.6 0.9 1 0

Polar coordinates 0.9 1.0 0.9 3 1

Sequence 13 Recognition Features Place Pick Delete # ins. # subs.

Angle vector quantized 1.0 1.0 0.9 1 1 Angle vector 1.0 1.0 0.6 0 4 Delta vector 1.0 0.2 1.0 2 0

Polar coordinates* 1.0 1.0 0.8 0 2 Sequence 14 Recognition

Features Place Pick Delete # ins. # subs. Angle vector quantized 0.9 1.0 0.8 3 3

Angle vector 0.9 1.0 0.8 4 3 Delta vector* 1.0 1.0 1.0 3 0


Total Recognition Features Place Pick Delete # ins. # subs.

Angle vector quantized 0.975 1.0 0.825 5 7 Angle vector 0.975 1.0 0.825 6 8 Delta vector 0.975 0.525 0.925 5 2


TABLE 5.1. Recognition rate for feature selection, including the number of insertions (# ins.) and substitutions (# subs.)

(3) acquire multiple repetitions of each of the three gestures continuously in a realistic appli

cation context, while logging raw mouse input data in a file2

(4) for every feature vector, perform the recognition pro cess on the realistic sequence and

measure the recognition rate

The selected feature sets are the following: delta positions (dx, dy), the movement vector

in polar coordinates (r, 0), the movement vector angle (0) and the quantized movement vector

angle (Oq), as suggested by Lee [59]. Although each of the four feature vectors was evaluated

2We chose ten repetitions as an arbitrary number that was not too onerous for new users, yet sufficient to obtain adequate recognition rates.

58


with respect to its recognition accuracy, recognition was also performed on-line during this pro cess

in order to provide the user with visual feedback. Each sequence in Table 5.1 indicates by an

asterisk which vector was used for this purpose. Fort y repetitions of each gesture are therefore

considered to determine which feature vector leads to the highest recognition rate. The recognition

rate is calculated using the ratio between recognized gestures over the number of performed gestures,

thus not considering insertions. The latter recognition error occurs when random movement was

recognized as being a gesture, whereas a substitution happens when there is confusion between two

models. A deletion is considered when a gesture was performed without having been spotted.

As seen in the results of Table 5.1, every considered feature vector offers approximately the

same recognition rate, and the number of insertions and substitutions do not differ significantly. The

results could have been easily predicted, because "polar coordinates" and "delta vector" actually

provide the same data to the recognizer, albeit in different representations. Since the use of angle

vector results in approximately the same recognition rates, it has been decided that the vector

magnitude, or movement speed, does not provide any further information to the recognizer in most

cases. Therefore, the angle vector is considered the most meaningful feature vector for the rest of

the tests. The quantization of the angle did not offer significantly better results.

It is however possible to recover information related to the gesture's velocity of execution from

raw data. Since usual gesture capturing devices such as mouse or data glove are sampled at a fixed

rate, the number of data points that compose the gesture is dependent on the speed of execution.

The faster a gesture is executed, the smaller the number of data points it will be composed of, and

vice-versa. With the input token data passing system, it is possible for a user who needs information

on speed of execution to recovèr it.

The number of insertions and substitutions for the considered feature vector is relatively high

compared to what would be expected from an accurate gesture recognition system. Most of the

confusion in the gesture recognition procedure occurs when the "delete" gesture is performed, which

is recognized as being the "pick" gesture. This confusion is related to the fact that sometimes

the "dei ete" gesture is executed with less precision, leading to rounded changes of direction. The

performed gesture hence becomes recognized as being more of a circular shape than a back and forth

movement, which leads to confusion with the "pick" gesture. A possible solution to this problem

would be to train the "delete" gesture with non ideal sequences, similar to the ones performed

59


during the recognition stage. Achievahle ways of implementing the latter solution would he to use

a larger collection of training data from a wide range of users, or interactively train HMMs during

the recognition stage [15,102].

On a more qualitative note, the number of insertions and substitutions that can be observed in

Table 5.1 gives a good idea as to which feature vector will be the most interesting to use for novice or

experienced users. The feature vectors that do not have a high recognition rate have less insertions

or substitutions, but more deletions, which is the number of errors minus the number of substitutions

that are considered in the recognition rate calculation. Therefore, in order to be recognized, gestures

have to be performed more precisely, similar to the ones that were used to train the hidden Markov

models. This behaviour would be acceptable for novice users who do not know perfectly how the

system works. It would however not be applicable for experienced users who perform gestures faster

and less accurately. Their tolerance to recognition errors is likely to be higher because experienced

users know how to recover from classification errors.

Recognition Sequence number Place Pick Delete nb. ins. nb. subs.

11 1.0 1.0 1.0 0 0 12 1.0 1.0 1.0 1 0 13 1.0 1.0 0.9 0 1 14 0.8 0.9 1.0 0 3

total 1 0.925 1 0.975 1 0.975 1 1 4

TABLE 5.2. Mouse-based gesture recognition rate with improved trained HMMs

5.1.2. Mouse-Based Gesture Recognition Rate. In this section, the training set of

gestures was selectively constructed in order to improve the recognizer's efficiency for the same

recognition sequences as in the previous section. Improving a model consists of ad ding samples

to the training database, then performing the recognition process, and executing another training

pass. This is done in order to ensure that the gesture model takes into account sequences similar to

the ones that were not recognized in the first recognition round. This procedure takes place until

satisfactory results are obtained. Enhanced models were the "pick" and "delete" gestures, which

were too often substituted when using the original training sets. The improvements generally lead

to higher recognition rates and less insertions and substitutions as can be seen in Table 5.2. Results

60

5.1 CONTINUO US DYNAMIC GESTURE RECOGNITION UNDER SEVERAL CONDITIONS

-t :::::- ~ ...... _._-~ tf\ Loop Square Cross Delete Angle Infinity Fish Triangle Circle Wedge

FIGURE 5.2. Large gesture set

show that over 120 performed gestures, an average of 96% were recognized with a database composed

of three gestures.

For a larger number of gestures, it is expected that the recognition rate will be lower since

gestures are easily confused, especially very similar ones. Figure 5.2 shows ten different gesture

models used as a dataset in order to test the recognizer's performance for a large number of potential

hypotheses. In this experiment, the ten gestures are trained using twenty repetitions each and the

recognition tests use ten continuous repetitions of every gesture.

TABLE 5.3. Mouse gestures recognition rate for a large number of possible gestures, including insertion and substitution errors

As seen in Table 5.3, the recognition rate for a large number of possible gestures is satisfying,

with an average of 88%. The most highly confusable gesture is "fish", which is often confounded

with "loop", likely so because these two gestures have similar shapes and starting directions, which

leads the recognizer to confusion since the starting point of a gesture sequence is unknown. It should

be noted that the number of insertions reported in Table 5.3 does not take into account insertions

that occur during random mouse movement, but only the ones that occur during the actual gesture

performance. It is therefore obvious that gestures akin to "fish", "loop" and "circle" are likely to

be recognized whenever the mouse movement is similar to the trained models. Choosing gestures

different from free-hand movement is therefore of crucial importance.

5.1.3. Glove-Based Gesture Recognition Rate. Gesture recognition experiments have

also been performed using a P5 data glove [31]. The training set is, as with the mouse, composed of

twenty repetitions for each of the three considered gestures. The recognition stage was to perform

about twenty samples of every gesture.

61


As seen in Table 5.4 in the "original models" section, the recognition rate is much lower than

when the mouse is used as an input modality, with an.average recognition rate of 75%. The disap-

pointing outcome is not a problem of sampling precision since the P5 data glove offers a resolution

of 0.3 cm [31], which is sufficient given that the amplitude of a typical gesture is on the order of tens

of centimetres. Poor training of gesture models is the cause of all encountered errors during those

tests. The "place" gesture was not trained correctly for this run since lots of insertions occurred

when performing "move" or "delete" gestures. A human factor explains why it is so hard to obtain

an accurate gesture model in the training: fatigue. The training phase of the experiment was to

perform twenty repetitions of each gesture, successively. It is however quite tiring for a human to

hold their arm in the air for long periods of time. This is why a second run of the same experiment

has been conducted, though allowing the user some time to rest after five actual gestures, expecting

higher recognition rates.

Original models Place Move Delete

Recognition rate 0.71 0.79 0.74 Errors Substitutions with Insertions of Lots of insertions

"move" "place" of "place"

Improved models Place Move Delete

Recognition rate 0.81 1.0 0.81 Errors Deletions None Substitutions with

"place" and "pick"

TABLE 5.4. Recognition rate of glove gestures with original and improved models

As seen in the "improved models" section of Table 5.4, using enhanced models leads to a higher

average recognition rate of 87% for the three considered gestures, as weIl as fewer errors. Two

conclusions can be drawn from this last experiment: first, the quality of training data significantly

influences the expected recognition rate. The more accurate a model is, the easier the gesture will

be recognized at runtime. Secondly, it is important for free-hand gestures to avoid using ones that

need prolonged holding of the arm in the air. These should therefore only be used when they bring

another dimension to the interaction, such as executing virtual manipulation, as opposed to be

invoking every single operation. A well-known use of gestures is to refer to the spatial dimension of

62

5.2 THE CONTEXT GRABBER'S INFLUENCE ON THE RECOGNITION RATE

things, which can he unnatural to specify with other modalities. However, instruct commands to a

system can easily be done using speech, which will be discussed in Section 6.2.

5.1.4. Vision-Based Hand Gesture Recognition. The video-hased hand position cap-

ture system described in Section 3.5.3 was employed in order to provide input data to the HMM

gesture recognizer. However, due to the following system limitations, no meaningful data could he

obtained from the preliminary experiments: since the video tracker is still under development, its

performance is significantly below what would be expected of an appropriate tracking system. The

frame rate is approximately 16 position updates per second on a Pentium IV 2.6 GHz, which is not

sufficient for prolonged use without perceiving an annoying lag and delay between the actual hand

movement and the sight of the virtual cursor moving on the projection screen. Ware [99] reports

that the hand tracking frame rate and lag is critical for having decent interaction with a virtual

environment.

Preliminary experiments were conducted in order to show that hidden Markov models can be

trained using a video-based tracking system with the current framework. Simple gestures have also

been recognized. However, as the tracker performance does not offer aIl the accuracy and precision

that hand gestures need, more meaningful results should be obtained from increased performance

of the tracker in the future. Nevertheless, the software framework supports vision-based gesture

recognition, which is promising for the integration of additional modules.

5.2. The Context Grabber's Influence on the Recognition Rate

As presented in Section 4.7, the context grabber is used to restrain the number of possible

gesture hypotheses that can happen at every moment. This restriction takes advantage of the

application as weB as the user's status in order to eliminate gesture candidates. An experiment

showing differences in recognition rates between the inclusion and exclusion of the context grabber

has been conducted, using the large gesture set shown in Figure 5.2. The recognition rate is expected

to be similar for both situations since the same gesture models are used. However, more insertions

of incorrect gestures should be observed when the context grabber is not used, especially gestures

similar to a user's random movement.

Table 5.5 shows that the recognition rate do not vary significantly whether the context grabher

is included or excIuded, though being marginally lower when the context grabber is turned off.

63

5.3 THE EXPERIMENTAL APPLICATION

Gesture Context ON Context OFF

Sequence Measure Place Move Delete Place Move Delete

11 Rate 1.0 1.0 1.0 0.9 0.8 1.0

Insertions 1 0 0 2 4 1

12 Rate 1.0 0.9 1.0 0.8 0.8 1.0


13 Rate 0.9 0.9 1.0 0.9 0.8 1.0


14 Rate 0.9 1.0 1.0 0.9 1.0 1.0


TABLE 5.5. Recognition results for large number of gestures with and without context grabber

However, a significant additional amount of gesture insertions occur without the context grabber.

This simple experiment shows two things: first that taking advantage of the context can help

improving overall recognition performance, especially when gestures share similar shapes (e.g. "loop"

and "circle"). Second, one should limit the number of possible gestures in a context to the minimum,

such that the recognizer will not be confused and the processing time it takes to generate hypotheses,

which is linear relative to the number of available gesture models, will not be too long. The rest of

the algorithm is constant time, bounded by the time needed to pro cess the valid hypotheses.

This simple experiment also shows that not only should the system take advantage of the

context in order to better recognize gestures performed by the user, but users should choose their

gestures in such a way that they will not be confounded with random movement that occurs between

two actual gestural expressions. A more complex context grabber would also help the system to

further constrain the number of gestures to be recognized, which will be discussed in Section 6.2.

5.3. The Experimental Application

An experimental application of the framework, which takes advantage of the available gesture

recognition input modalities, was implemented in order to demonstrate and test the validity of the

different concepts presented throughout this thesis. The application allows a user to place and

modify the appearance and state of virtual objects in a three-dimensional (3D) world. It is possible

to place three-dimensional models in the virtual space, or two-dimensional images on the screen

64


FIGURE 5.3. Typical framework's application scene

plane.3 Those entities can then be moved around in the virtual world, using hand gestures. Three

dimensional models can also be textured and rotated using hand gestures. Entities can be deleted

and a history of applied actions is kept in order to allow undo and redo facilities.

A typical view of a virtual world might look like the one shown in Figure 5.3. In this scene,

different 3D models (sorne chairs, a plant and a fridge) were placed, moved, rotated and textured

in order to demonstrate the purpose of the current framework. Concrete actions were programmed

in order to allow the aforementioned operations on objects. The specifications of those actions and

what they expect from an input token and world entities can be seen in Annex B.l.

In addition to the mouse-based gesture recognizer, the "clockTick" input modality is used in

order to periodically emit tokens at specified time intervals. In the present case, a clock tick is sent

every 100 milliseconds, which is an arbitrary value set by the user. These periodic events trigger an

action that displays the system's information in a 3D text entity. This information gives a clue to

3Three-dimensional models are 3D Studio Max files.

65


LChair} Place Model3D ~Fridge Placing

Plant Pick for rotation ----.. ~ Rotating Pick for translation ~ Trans/ating

LBrown

} Texture ~Grey Texturing

Red Place Image } Undo Idle Redo Delete

FIGURE 5.4. Experimental application's gesture dialogue

the user as to which state the interaction manager is in at every moment; in Figure 5.3 the system

is in "translating" mode. Providing this information helps users to recover from errors since there

would otherwise be no way of knowing if a gesture would have been recognized correctly or if an

insertion of a wrong gesture would have happened.

A gesture dialogue is proposed in order for a user to reuse known gestures for invoking many

actions. This dialogue is managed with gestures that adjust the interaction manager state, as seen

in Figure 5.4, where new states are in italic. For instance, in order to place a new three-dimensional

model, a user needs to perform the "placeModeI3D" gesture, which notifies the interaction manager

that the system now needs to recognize gestures that can be performed when in "placing" mode.

In order to exit the "placing" mode, a user needs to execute a gesture that brings the interaction

manager back to the "idle" mode, or whatever was specified in the configuration. The same scheme

is employed for texturing, rotating and translating 3D objects. Since the interaction manager needs

to be in a specific state in order for actions to be applied to certain entities, gestures can be reused

for invoking multiple actions. For instance, placing a "chair" and texturing a model "brown" can

both be invoked using the same gesture without any conflict or misinterpretation. In addition, the

context grabber is used to reduce the number of gestures to be recognized at every moment. It

therefore allows users to define specifie gestures that can be similar for two actions applied in two

different interaction manager's contexts.

For a complete reference on the XML configuration file used to configure the experimental

application, see Annex C.

66

5.4 FRAMEWORK'S PERFORMANCE WITH A LARGE NUMBER OF ENTITIES

5.4. Framework's Performance With a Large Number of Entities

The experimental application was used in order to test the entire software framework perfor-

mance. In this particular application, it is of crucial importance for the framework to keep a decent

refresh rate for the OpenGL output modality [99], even if the virtual world is cluttered with a large

number of world entities. Two restrictions must be taken into account when using world entities that

are displayed on the screen in a three-dimensional OpenGL environment: the maximum numbet of

polygons that can be rendered by the graphics card and the maximum number of world entities that

can be processed by the interaction manager every time an input token is emitted. Experiments were

conducted to observe the effect of adding world entities to the virtual world in terms of OpenGL

display frame rate and the interaction manager's processing time.

-

Frame rate function of number of entities (debug version)

o~~~~~~~~~~4-~~~~~~~~~~

o 200 400 600

Number of entities

800 1000 1200

I

--+-Idle 1 _____ Moving cursor

-A-Moving entity

FIGURE 5.5. Framerate funtion of number of entities (debug version)

The software was run on a Pentium M 1.8 GHz, 512 MB RAM, ATI Mobile FireGL T2 128

MB RAM running on Windows XP Pro SP2, compiled with Microsoft Visu al Studio .NET 2003

in debug and release versions, using a ACLHigh_Res_Timer [81) to measure the processing time. A

Win32 timer triggers the rendering at a maximum frame rate of 50 Hz for both versions as seen in

Figures 5.5 and 5.6 when the number of entities is low. The "idle" curves of these two figures shows

67


Frame rate function of number of entities (release version)

I :~ldle 1 _Moving cursor

-6--Moving enlity

O~~~~~~~~~~4-~~~~~~~~~~

o 200 400 600 800 1000 1200

Number of entities

FIGURE 5.6. Framerate funtion of number of entities (release version)

the frame rate when the cursor is not moving, which means that the processing time is entirely

spent on drawing the scene. In debug mode, the "idle" curve shows a significantly higher frame rate

than when the cursor is moving. In the latter condition, the interaction manager has to pro cess

"moveCursor" input tokens, which results in finding out whether or not an action has to be applied

to every entity. In addition, the gesture recognizer has to pro cess incoming data in order to find

out if the sequence of incoming positions corresponds to a known gesture. This leads to a decreased

frame rate as observed on the "Moving cursor" curve in the debug version of the program. When an

entity has been picked for translation and is being moved around in the virtual space, an additional

step has to be processed, which is validating the possible entities that can be moved. In fact,

when an entity is picked, the interaction manager's state becomes "translating", which allows the

"translate" action to be applied to "translatable" entities. In the present case, every entity owns the

"translatable" behaviour; hence, they aIl need to be validated by the action. This "validateEntities"

supplementary step takes a significant amount of time in the debug mode, and keeps only the entity

whose "picked" property is set to "true". This is why a lower frame rate can be observed on the

"Moving entity" curve. The debug version of the software is interesting because it shows which parts

68


of the interaction manager should benefit the most of an optimization that is, in the present case,

the action entities validation method.

Unlike the debug version, the three curves displaying the release version's performance in Fig-

ure 5.6 do not show a large difference between the three aforementioned cursor's states. In fact, the

performance drop caused by having picked an entity is negligible compared to the time needed to

render the scene. The large difference between the debug and release versions of the software can be

explained by the fact that the release version uses an optimized version of the C++ library, which

includes the STL library. Since STL is extensively used in the software, the performance increases

accordingly compared to using the debug version. The release version of the software keeps a frame

rate of 24 Hz, below which humans notice a sloppiness, for a number of entities around 850. The

entities used in the experimentation are two-dimensional images made up of 128x128 pixels textures

displayed on an orthogonal view.

Interaction manager's processing lime function of number of entilies

70~-------------------------------------,

60~------------------------------~~--~

50~------------------------~~--------~

§. ~ 40r-------------------~~--------------~

Cl

" -~ 30r-------------~~--------------------~ u o ct 20r---------~~------------------------~

10r---~~------------------------------~

o 200 400 600

Number of entities

800 1000 1200

-+-Oebug version _____ Release version

FIGURE 5.7. Interaction manager's processing time function of number of entities

Figure 5.7 shows a plot of the processing time needed to parse a "moveCursor" token when an

entity is picked, as a function of the number of entities. The curve showing the performance of the

debug version appears to be linear with a slope of approximately 6 ms per 100 entities, the algorithm

69

5.5 GENERAL DISCUSSION AND LIMITATIONS

being linear in terms of entities for a given number of actions. However, the release version of the

software shows a different curve that tends to be linear, with a much lower slope that is far below 1

ms per 100 entities. This plot therefore shows that the software limitation is not a matter of number

of actions and entities, but rather a number of polygon rendering, which could be improved with

optimized OpenGL commands and textures [83].

When performing this experiment, it was noticed that the different software threads' priorities

play an important role in the visual user feedback. When using a mouse-based gesture recognizer

and an OpenGL display, three concurrent threads are running in the software: one that sends input

tokens to the interaction manager (input modality's thread), one that takes the input tokens, parses

them and applies actions on the world entities (interaction manager's thread), and another that

renders the world in OpenGL displays (main thread). The main display thread is triggered every 20

milliseconds by a Win32 timer, which tries to meet the 24 frames per second requirement. However,

the interaction manager's thread needs to have a higher priority than the display thread in order

to invoke actions corresponding to the input events first, and then draw the results. The opposite

would have the possible effect of drawing non-updated entities on the screen while tokens would be

waiting in the queue to be processed. On the other hand, the mouse thread needs to have a higher

priority than the display thread, but it is more important to pro cess input tokens than collecting

new data.

This empirical adjustment of threads' priority removes the lag in the mouse cursor's position

that could be observed when aIl threads had the same priority with a large number of entities

displayed. Instead of a lag on the screen, the effect of too many input tokens sent at the same time

is that points are missing in the mouse trajectory. This is notably due to the fact that mouse events

originate from the main thread, which does not have the highest priority and that drops mouse

positions when they are too numerous in the queue.

5.5. General Discussion and Limitations

The continuous gesture recognition algorithm and the implemented software framework are ob

viously not perfect and have several limitations that will be described in this section. The gesture

spotting algorithm succeeds when finding simple gestures from the input data stream, but more com

plex ones are hard to recognize especiaIly when they are composed of known simpler gestures. This

70

5.5 GENERAL DISCUSSION AND LIMITATIONS

limitation does not allow for long and complex gestures to be recognized. However, natural gestures

are almost never complex, so this limitation might not be that crucial for such an interface [70].

Another problem is that the gesture spotting algorithm occasionally introduces errors in the

gesture starting point detection. This is notably due to the often incorrect assumption that the

best hypothesis should be the longest gesture. The gesture spotting algorithm should also be more'

constrained on the acceptance of the best hypothesis. Preliminary user testing shows that they find

insertions more annoying than deletions from the gesture stream. These are just qualitative results

that should be revised and confirmed in the future with real experimental data. More feedback to

the user should also be provided, for instance, show which gestures are associated with which actions

or show the anticipated beginning of a spotted gesture.

As for the software framework, no critical performance limitation was observed while imple

menting the small application presented earlier in this chapter, nor for a version of the program that

uses OpenSceneGraph [69] as an output modality. The author of the framework however wrote the

additional actions and world entities, which may be a bias in favour of the framework. With future

use, more conclusions will be drawn on the flexibility and extensibility that the software framework

offers.

Additionally, the "clockTick" input modality allows for time-stepped actions, which is interest

ing for animations or any operation that would need periodic updates. One problem that could be

encountered is the lack of scheduler in the interaction manager, as the input tokens are parsed in a

first-in-first-out (FIFO) strategy. A scheduler would allow selected input tokens to be parsed before

others, as needed.

With larger applications cornes the problem of a large XML configuration file. The software

framework would benefit a simple graphical user interface (GUI) that would allow configuring every

component before starting the software. Such a GUI would also benefit having utilities in order to

configure at runtime the entities and every manager used in the application.

71

CHAPTER 6

Conclusion and Future Work

6.1. Conclusions

In this thesis, the problem of recognizing gestures using various input modalities was addressed.

Standard hidden Markov models were used in order to recognize temporal sequences of gesture fea

tures. HMMs are extensively used in speech recognizers and are adaptable for gesture recognition,

given an appropriate feature vector choice. The most advantageous feature vector that was chosen

is the angle of the vector describing the hand movement. LTI-Lib was used as an implementation

of hidden Markov models data structure and algorithms. The training algorithm is the segmental

K-means, which optimizes the HMM's parameters for the most likely state sequence only, rather

than optimizing the model for every state sequence like the Baum-Welch algorithm. The recognition

algorithm is composed of three parts: hypotheses generation, hypotheses scoring (Viterbi algorithm)

and pruning, and gesture spotting. Gesture spotting is necessary due to continuous gesture recogni

tion, which means that both start and end points of a gesture are unknown. A gesture is considered

as being spotted when the most likely hypothesis fits criteria that take advantage of the HMM's

state structure.

In order to allow multiple input and output modalities to be used in the general and flexible

context of a virtual world model, a software framework was developed. This framework uses a generic

data container in order to represent knowledge in all the data pipeline modules. The data flow is

initiated using input tokens emitted every time an event occurs in an input modality. Input tokens

contain aB the information that one needs to correctly analyze what happened in the real world, and

execute further associated operations on the virtual world. In order to apply actions on the virtual

6.2 FUTURE WORK

world, input tokens are parsed by an interaction manager that, given several different constraints,

decides which actions need to be applied to corresponding world entities. These constraints are

defined by a set of behaviours that belong to the world entities, which should match the associated

action's behaviours and data in order to be triggered. To be interactive, every world entity has

associated behaviours that furnish properties, which are to be modified by actions and retrieved in

order to affect the rendering process. This latter operation is performed by the output modalities,

which caU the rendering method of every world entity that knows how to render itself in a specifie

output modality type. The rendering pro cess is generic, thus aUowing not only visual output, but

any kind of output modality to share the same data pipeline.

An experimental application was implemented in order to demonstrate the various concepts

developed throughout the thesis. Several gesture input modalities (mouse, glove, vision) were also

implemented in order to experiment gesture interaction with virtual worlds. Basic experiments were

conducted in order to test recognition rates under sever al conditions, which prove that continuous

gesture recognition is possible at a reasonable recognition rate, recognizing in the order of 10 HMMs

at the same time.

6.2. Future Work

In terms of future work, it would be interesting to experiment with user testing in order to

establish if whether or not gestures are usable as a unique modality when interacting with a virtual

world. Speech input modality would also be interesting to integrate in the framework in order to

provide a way for users to say a given command instead of performing a gesture. Such a multimodal

system would use multimodal commands in order to take advantage of both modalities. In addition,

a modality integrator would be necessary in order to manage and merge information coming from

multiple input sources. Such work is currently being pursued in the SRE laboratory and is planned

to be integrated in the near future.

In terms of gesture interaction, richer gesture syntax would provide supplementary parameteri

zation possibilities. The position and curvature of fingers could be used for giving extra information

and context to the system, allowing more complex grammars to be used. Sorne work has already

been done in that direction in the SRE laboratory, providing algorithms to segment the hand and

detect fingertips. The remaining step is the integration effort of these algorithms in the current

73

6.2 FUTURE WORK

framework as weB as a gesture recognizer's adaptation, so it can recognize gestures parameterized

with different finger positions. A more complex feature vector and possibly recognition algorithm

would then be needed in order to adapt to the new input stream. Additional gesture-based input

modalities could also be implemented. For exampIe, systems such as those developed by Polhe

mus could allow for accurate three-dimensional position and rotation in contrast with the current

vision-based system, which has not yet reached the desired Ievel of accuracy and speed.

Online gesture training could also be implemented in order to keep the gesture models up-to

date while the user is performing them. This additional training phase would probably improve

overall recognition rate and would decrease the number of insertions. There should however be

a way of indicating to the system when a gesture was not correctly recognized, so the training

gestures are only those confirmed of having been accurately spotted. Garbage models could also be

trained, taking advantage of known incorrect gestures in order to discriminate easily wrong gestural

expressions.

Improved context grabbing could also be implemented in order to retrieve mare data from the

user's state, which would put additional constraints on the possible actions that can occur at any

point in time. For example, a system that would use eye tracking would know where the user is

looking at, and would be able to restrain the actions to the ones associated with a given "target"

object.

Enhanced network support could also be implemented using a middleware such as CORBA in

order to manage objects remotely, transparently for the user. ORB services could also be used,

such as event channels or timing services, that would allow synchronization between multiple virtual

environment instances.

Finally, it is worth to mention that the goal of the current thesis was to prove that gestures can

be used in arder to control a virtual world, but there are several drawbacks in only using gestures.

This is why the pr~ented software framework was designed and implemented with the idea in mind

that one day it would be used for other modalities than gesture inputs. The next step in the research

is to incorporate other input modalities such as speech and build an output modality that could be

used in a CAVE immersive environment like the one that is owned by the SRE laboratory. This

would be a step forward toward immersive computing.

74

REFERENCES

[1] Marcell Assan and Kirsti Grobel, Video-based sign language recognition using hidden Markov

models, International Gesture Workshop on Gesture and Sign Language in Human-Computer

Interaction, Springer, 1998, pp. 97-109.

[2] Thomas Baudel and Michel Beaudouin-Lafon, CHARADE: remote control of objects using

jree-hand gestures, Commun. ACM 36 (1993), no. 7, 28-35, ACM Press.

[3J François Bernier, Denis Poussart, Denis Laurendeau, and Martin Simoneau, Interaction

centric modelling for interactive virtual worlds: The APIA approach, 16 th International

Conference on Pattern Recognition (ICPR'02) Volume 3, IEEE Computer Society, 2002.

[4] Allen Bierbaum, Christopher Just, Patrick Hartling, Kevin Meinert, Albert Baker, and Car

olina Cruz-Neira, VR Juggler: A virtual platform for virtual reality application development,

Virtual Reality 2001 Conference (VR'Ol), IEEE Computer Society, 2001, p. 89.

[5] Jeff Bilmes, What HMMs can do?, Tech. report, University of Washington, 2002.

[6] Henrik Birk and Thomas Baltzer Moeslund, Recognizing gestures from the hand alphabet

using principal component analysis, Masters thesis, Aalborg University, Denmark, 1996.

[7] Michael J. Black and Allan D. Jepson, Recognizing temporal trajectories using the CON

DENSATION algorithm, 3rd. International Conference on Face & Gesture Recognition,

1998, pp. 16-21.

[8] Aaron F. Bobick and James W. Davis, The recognition of human movement using tempo

ral templates, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001), no. 3, 257-267, IEEE

Computer Society.

REFERENCES

[9J Aaron F. Bobick and Yuri A. Ivanov, Action recognition using probabilistic parsing, IEEE

Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Com

puter Society, 1998, pp. 196-202.

[lOJ Richard A. BoIt, "Put-That-There": Voice and gesture at the gmphics interface, SIGGRAPH

'80, 7th annual conference on Computer graphies and interactive techniques, ACM Press,

1980, Seattle, Washington, United States, pp. 262-270.

[I1J Yves Boussemart, François Rioux, Frank Rudzicz, Mike Wozniewski, and Jeremy R. Coop

erstock, A fmmework for 3d visualization and manipulation in an immsersive space using

an untethered bimanual gestuml interface, ACM Symposium on Virtual Reality Software

and Technology, ACM Press, 2004.

[12J Matthew Brand, Nuria Oliver, and Alex Pentland, Coupled hidden Markov models for com

plex action recognition, Conference on Computer Vision and Pattern Recognition (CVPR

'97), IEEE Computer Society, 1997, pp. 994-999.

[13J Peter Bull, State of the art: Nonverbal communication, The Psychologist 14 (2001), 644-647.

[14J Lee W. Campbell, David A. Becker, Ali Azarbayejani, Aaron F. Bobick, and Alex Pentland,

Invariant features for 3-d gesture recognition, Automatic Face and Gesture Recognition,

1996, pp. 157-163.

[15J Xiang Cao, An explomtion of gesture-based intemction, Masters thesis, Department of Com

puter Science, University of Toronto, 2004.

[16J Xiang Cao and Ravin Balakrishnan, Evaluation of an online adaptive gesture interface with

command prediction, Graphical Interface Conference, 2005, pp. 187-194.

[17J Michael Capps, Don McGregor, Don Brutzman, and Michael Zyda, NPSNET- V: A new

beginning for dynamically extensible virtual environments, IEEE Comput. Graph. Appl. 20

(2000), no. 5, 12-15.

[18J Chris ter Carlsson and Olof Hagsand, DIVE - a multi user virtual reality system, IEEE

Virtual Reality Annual International Symposium, 1993, pp. 394-400.

76

REFERENCES

[19J Jeremy R. Cooperstock, Interacting in shared reality, HCI International, Conference on

Human-Computer Interaction (Las Vegas), 2005 (to appear),

http:j jwww.cim.mcgill.cajsrejpublicationsjhci05.pdf.

[20J Immersion Corp., CyberGlove, http:j jwww.immersion.comj3djproductsjcybeLglove.php.

[21J ___ , CyberGrasp, http://www.immersion.com/3d/products/cybeLgrasp.php.

[22J Microsoft Corp, Raw input, http:j jmsdn.mierosoft.comjlibrary jdefault.asp?url=jlibrary jen

usjwinuijwinuijwindowsuserinterfacejuserinputjrawinput.asp.

[23J Ascension Technology Corporation,

tech.comjproductsjflockofbirds.php.

Flock of birds, http:j jwww.ascension-

[24J Carolina Cruz-Neira, Daniel J. Sandin, Thomas A. DeFanti, Robert V. Kenyon, and John C.

Hart, The CA VE: audio visu al experience automatic virtual environment, Commun. ACM

35 (1992), no. 6, 64-72, ACM Press.

[25J Ross Cutler and Matthew Turk, View-based interpretation of real-time optical fiow for ges

ture recognition, Automatie Face and Gesture Recognition, 1998, pp. 416-42l.

[26J Marek Czernuszenko, Dave Pape, Daniel Sandin, Tom DeFanti, Gregory L. Dawe, and Max

ine D. Brown, The ImmersaDesk and infinity wall projection-based virtual reality displays,

Computer Graphies 31 (1997), no. 2, 46-49.

[27J Andries van Dam, Post- WIMP user interfaces, Commun. ACM 40 (1997), no. 2, 63-67.

[28J Trevor Darrell and Alex P. Pentland, Space-time gestures, Conference on Computer Vision

and Pattern Recognition, 1993, pp. 335-340.

[29J Konstantinos G. Derpanis, A review of vision-based hand gestures, Tech. report, York Uni

versity, Toronto, Canada, 2004.

[30J Konstantinos G. Derpanis, Richard P. Wildes, and John K. Tsotsos, Hand gesture recognition

within a linguistics-based framework, ECCV04, Springer, 2004, 3021, pp. 282-296.

[31J dimensionline, P5 glove, http://www.p5glove.com.

[32J Irfan A. Essa and Alex P. Pentland, Facial expression recognition using a dynamic model

and motion energy, Fifth International Conference on Computer Vision, IEEE Computer

Society, 1995, pp. 360-367.

77

REFERENCES

[33] Andrew Fischer and Judy M. Vance, PHANToM haptic device implemented in a IJrojection

screen virtual environment, Workshop on Virtual environments 2003, ACM Press, 2003,

Zurich, Switzerland, pp. 225-229.

[34] G. David Forney Jr., The Viterbi algorithm, Proc IEEE 61 (1973), 268-278.

[35] Apache Software Foundation, Xerces-G++, http://xml.apache.org/xerces-c.

[36] Jean-Marc François, Jahmm, 2005,

http://www.run.montefiore.ulg.ac. bel ",francoisl softwarefjahmm/.

[37] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design patterns: elements

of reusable object-oriented software, Addison-Wesley Longman Publishing Co., Inc., 1995.

[38] Zoubin Ghahramani, An introduction to hidden Markov models and Bayesian networks,

Hidden Markov models: applications in computer vision, World Scientific Publishing Co.,

Inc., 2002, World Scientific Publishing Co., Inc., pp. 9-42.

[39] GHMM, 2004, http://www.ghmm.org/.

[40] Benjamin A. Goldstein, Tandem: A component-based framework for interactive, collabora

tive virtual reality, Masters thesis, University of Illinois, Chicago, USA, 2000.

[41] Chris Greenhalgh and Steve Benford, MASSIVE: a distributed virtual reality system incor

porating spatial trading, 15th International Conference on Distributed Computing Systems

(ICDCS'95), IEEE Computer Society, 1995, pp. 27-34.

[42] Chris Greenhalgh, Jim Purbrick, and Dave Snowdon, Inside MASSIVE-3: flexible support

for data consistency and world structuring, Third international conference on Collaborative

virtual environments (San Francisco, California, United States), ACM Press, 2000.

[43] Object Management Group, Unified modeling language, http:j Iwww.uml.org.

[44] Yves Guiard, Asymetrie division of labor in human skilled bimanual action: the kinematic

chain as a mode l, Journal of motor behavior 19 (1987), no. 4, 486-517.

[45] Martin Hachet, Pascal Guitton, and Patrick Reuter, The GAT for efficient 2d and 3d inter

action as an alternative to mouse adaptations, ACM symposium on Virtual reality software

and technology (Osaka, Japan), ACM Press, 2003.

78

REFERENCES

[46] Patrick Hartling, Allen Bierbaum, and Carolina Cruz-Neira, Tweek: Merging 2d and 3d in

teraction in immersive environments, 6th World Multiconference on Systemics, Cybernetics,

and Informatics (Orlando, Florida), 2002.

[47] Ken Hinckley, Patrick Baudisch, Gonzalo Ramos, and François Guimbretière, Design and

analysis of delimiters for selection-action pen gesture phrases in scriboli, SIG CHI conference

on Human factors in computing systems (Portland, Oregon, USA), ACM Press, 2005.

[48] IGN Entertainment inc., planet Black and White, http:j jwww.planetblackandwhite.com.

[49] Hiroshi Ishii and Brygg Ullmer, Tangible bits: towards seamless interfaces between people,

bits and atoms, SIG CHI conference on Human factors in computing systems, ACM Press,

1997, Atlanta, Georgia, United States, pp. 234-241.

[50] Yoshio Iwai, Hiroaki Shimizu, and Masahiko Yachida, Real-time context-based gesture recog

nition using HMM and auto maton, International Workshop on Recognition, Analysis, and

Tracking of Faces and Gestures in Real-Time Systems, IEEE Computer Society, 1999,

pp. 127-134.

[51] Yoshio Iwai, Ken Watanabe, Yasushi Yagi, and Masahiko Yachida, Gesture recognition using

colored gloves, International Conference on Pattern Recognition (ICPR '96) Volume l, IEEE

Computer Society, 1996, pp. 662--666.

[52] Biing-Hwang Juang and Lawrence R. Rabiner, The segmental k-means algorithm for esti

mating parameters of hidden Markov models, IEEE Transaction on Acoustic, Speech and

Signal Processing 38 (1990), no. 9, 1639-1641.

[53] John Kelso, Steven G. Satterfield, Lance E. Arsenault, Peter M. Ketchan, and Ronald D.

Kriz, DIVERSE: a framework for building extensible and reconfigurable device-independent

virtual environments and distributed asynchronous simulations, Presence: Teleoper. Virtual

Environ. 12 (2003), no. 1, 19-36, MIT Press.

[54] Adam Kendon, Current issues in the study of gesture, The biological foundations of gestures:

motor and semiotic aspects (1986), 23-47.

79

REFERENCES

[55] Nils Krahnstover, Sanshzar Kettebekov, Mohammed Yeasin, and Rajeev Sharma, A real

time framework for natural multimodal interaction with large screen displays, ICMI, 2002,

pp. 349-354.

[56] Robert M. Krauss, Yihsiu Chen, and Purnima Chawla, Nonverbal behavior and nonverbal

communication: What do conversational hand gestures tell us?, Advances in experimental

social psychology (M. Zanna, ed.), Tampa: Academic Press, 1996, pp. 389-450.

[57] Takeshi Kurata, Takashi Okuma, Masakatsu Kourogi, and Katsuhiko Sakaue, The hand

mouse: GMM hand-color classication and mean shift tracking, IEEE ICCV Workshop on

Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems (RATFG

RTS'Ol), IEEE Computer Society, 2001.

[58] Marcus Vinicius Lamar, Hand gesture recognition using T-CombNET - a neural network

model dedicated to temporal information processing, Ph.D. thesis, Nagoya Institute of Tech

nology, 2001.

[59] Hyeon-Kyu Lee and Jin H. Kim, An HMM-based threshold model approach for gesture recog

nition, IEEE Trans. Pattern Anal. Mach. Intell. 21 (1999), no. 10,961-973, IEEE Computer

Society.

[60] David McNeil, Language and gesture, Cambridge University Press, Cambridge, 2000.

[61] Marielle Mokhtari, François Bernier, François Lemieux, Hugues Martel, Jean-Marc

Schwartz, Denis Laurendeau, and Alexandra Branzan-Albu, Virtual environment and

sensori-motor activities: Haptic, auditory and olfactory devices, The 12th International

Conference in Central Europe on Computer Graphics, Visualization and Computer Vision

(WSCG2004), vol. 1-3 Feb. 2-62004, UNION Agency - Science Press, 2004, pp. 109-112.

[62] Darnell Janssen Moore, Vision-based recognition of actions using context, Ph.D. thesis, Geor

gia Institute of Technology, Atlanta, GA, 2000.

[63] Mozilla, Mouse gestures, http:j joptimoz.mozdev.orgjgesturesj.

[64] Martin Naef, Edouard Lamboray, Oliver Staadt, and Markus Gross, The blue-c distributed

scene graph, Workshop on Virtual environments 2003 (Zurich, Switzerland), ACM Press,

2003.

80

REFERENCES

[65] Chan Wah Ng and Surendra Ranganath, Real-time gesture recognition system and applica

tion, Image Vision Comput. 20 (2002), no. 13-14, 993-1007.

[66] Jakob Nielsen, Heuristic evaluation, Usability inspection methods, John Wiley & Sons, Inc.,

1994, 189209, pp. 25-62.

[67] Kenji Oka, Yoichi Sato, and Hideki Koike, Real-time fingertip tmcking and gesture recogni-

tion, IEEE Comput. Graph. Appl. 22 (2002), no. 6, 64-71.

[68] OMG, CORBA, http://www.corba.org.

[69] OpenSceneGraph, http:j jwww.openscenegraph.org.

[70] Vladimir 1. Pavlovié, Rajeev Sharma, and Thomas S. Huang, Visual interpretation of hand

gestures for human-computer intemction: A review, IEEE Trans. Pattern Anal. Mach. Intell.

19 (1997), no. 7, 677-695, IEEE Computer Society.

[71] Vicon Peak, Motion capture systems, http://www.vicon.com.

[72] Polhemus, Tracking systems, http://www.polhemus.com.

[73] Francis K. H. Quek, Eyes in the interface, IVC 13 (1995), no. 6, 511-525.

[74] ___ , Unencumbered gestuml intemction, IEEE MultiMedia 3 (1996), no. 4, 36-47, IEEE

Computer Society Press.

[75] Francis K. H. Quek, Xin-Feng Ma, and Robert Bryll, A pamllel algorithm for dynamic

gesture tmcking, International Workshop on Recognition, Analysis, and Tracking of Faces

and Gestures in Real-Time Systems, IEEE Computer Society, 1999, pp. 64-69.

[76] Lawrence R. Rabiner, A tutorial on hidden Markov models and selected applications in speech

recognition, IEEE, vol. 77, 1989, pp. 257-286.

[77] Lawrence R. Rabiner and Biing-Hwang Juang, Introduction to hidden Markov models, IEEE

ASSP 3 (1986), no. 1,4-16.

[78] Gerhard Rigoll, Andreas Kosmala, and Stefan Eickeler, High performance real-time gesture

recognition using hidden Markov models, International Gesture Workshop on Gesture and

Sign Language in Human-Computer Interaction, Springer-Verlag, 1997, pp. 69-80.

[79] RWTH-Aachen, LTI-Lib, 2005, http:j jltilib.sourceforge.netjdocjhomepagejindex.shtml.

81

REFERENCES

[80] Kingsley Sage, A. Jonathan Howell, and Hilary Buxton, Developing context sensitive HMM

gesture recognition, Gesture Workshop, 2003, pp. 277-287.

[81] Douglas C. Schmidt, ACE adaptive

http://www.cs.wustl.edu/''-'schmidt/ACE.html.

communication environ ment,

[82] Atid Shamaie and Alistair Sutherland, Accurate recognition of large number of hand gestures,

2nd Iranian Conference on Machine Vision and Image Processing, 2003.

[83] Dave Shreiner, Bob Kuehne, Thomas True, and Brad Grantham, Performance OpenGL:

Platform-independent techniques, SIGGRAPH '04 Course, 2004.

[84] Sandeep Singhal and Michael Zyda, Networked virtual environments: design and implemen

tation, ACM Pressj Addison-Wesley Publishing Co., 1999.

[85] Opera Software, Mouse gestures in Opera, http:j jwww.opera.comjfeaturesjmousej.

[86] Thad Starner and Alex Pentland, Real-time American Sign Language recognition from video

using hidden Markov models, International Symposium on Computer Vision, IEEE Com

puter Society, 1995, pp. 265-270.

[87] William C. Stokoe, Sign language structure: an outline of the visual communication systems

of the American deaf, Linstock Press, 1960.

[88] Josephine Sullivan and Stefan Carlsson, Recognizing and tracking human action, 7th Euro

pean Conference on Computer Vision, Springer-Verlag, 2002, pp. 629-644.

[89] Donald Tanguay, Hidden Markov models for gesture recognition, Masters thesis, MIT, 1995.

[90] Russell M. Taylor, Thomas C. Hudson, Adam Seeger, Hans Weber, Jeffrey Juliano, and

Aron T. Helser, VRPN: a device-independent, network-transparent VR peripheral system,

VRST, 2001, pp. 55-61.

[91] HTK Team, HTK speech recognition toolkit, 2004, http:j jhtk.eng.cam.ac.ukj.

[92] SenseAble technologies, Haptic devices, http:j jwww.sensable.comj.

[93] Henrik Tramberend, Avocado: A distributed virtual reality framework, IEEE Virtual Reality,

IEEE Computer Society, 1999, pp. 14-21.

[94] Matthew Thrk, Perceptual user interfaces, Frontiers of human-centred computing, online

communities and virtual environments, Springer-Verlag, London, UK, 2001, pp. 39-51.

82

REFERENCES

[95] ___ , Gesture recognition, Handbook of virtual environments: Design, implementation,

and applications (K. M. Stanney, ed.), Lawrence Erlbaum Associates, 2002, pp. 223-238.

[96] Minh Tue Vo, A framework and toolkit for the construction of multimodallearning interfaces,

Ph.D. thesis, Carnegie Mellon University, 1998.

[97] Christian Vogler and Dimitris Metaxas, A framework for recognizing the simultaneous aspects

of american sign language, Comput. Vis. Image Underst. 81 (2001), no. 3, 358-384.

[98] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter

Wolf, and Joe Woelfel, Sphinx-4: A flexible open source framework for speech recognition,

Tech. report, Sun Microsystems, 2004.

[99] Colin Ware and Ravin Balakrishnan, Reaching for objects in VR displays: lag and frame

rate, ACM Trans. Comput.-Hum. Interact. 1 (1994), no. 4, 331-356.

[100] Kent Watsen and Michael Zyda, Bamboo - a portable system for dynamically extensible, real

time, networked, virtual environments, Virtual Reality Annual International Symposium,

IEEE Computer Society, 1998.

[101] Alan Wexelblat, Research challenges in gesture: Open issues and unsolved problems, Inter-

. national Gesture Workshop on Gesture and Sign Language in Human-Computer Interaction,

vol. 1371, Springer-Verlag, 1997, pp. 1-1l.

[102] Andrew Wilson, Adaptive models for gesture recognition, Ph.D. thesis, MIT, 2000.

[103] Yaser Yacoob and Michael J. Black, Parameterized modeling and recognition of activities,

Comput. Vis. Image Underst. 73 (1999), no. 2, 232-247, Elsevier Science Inc.

[104] Ming Hsuan Yang, Narendra Ahuja, and Mark Tabb, Extraction of 2d motion trajectories

and its application to hand gesture recognition, IEEE Trans. Pattern Anal. Mach. Intell. 24

(2002), no. 8, 1061-1074.

[105] ZeroC, abject oriented middleware, http://www.zeroc.com/.

[106] Jorg Zieren, Nils Unger, and Suat Akyol, Hands tracking from frontal view for vision-based

gesture recognition, 24th DAGM Symposium Pattern Recognition, vol. Volume Lecture

Notes in Computer Science LNCS 2449, Springer, 2002, pp. 531-539.

83

APPENDIX A

XML Notation

XML (eXtensible Markup Language) is a text format that employs the tagjattributes (or markup)

metaphor in order to represent tree-like data structures. Unlike HTML, tags are not imposed, but

defined by users. Document type definition (DTD) or XML Schema (XSL) specifications define how

data should be structured in a file. With modern parsers akin to Xerces [35], it is possible to validate

an XML file, given a definition file, in order to ensure consistency of the data representation and

report semantic errors in the vaHdated file.

In the framework's implementation, Xerces is used as an XML parser, while the document object

model (DOM) stores the XML data internaIly. DOM converts XML data in a tree-like structure

in which each XML data quanta is a DOM Node. The nodes have children and parents, as weIl as

attributes. To clarify the notation, here is an example of a node and its attributes in XML format:

<nodeName attributel-"value 1" attribute2-"value 2">

<childNode>parsed character data</childNode>

</nodeName>

In the previous example, anode with a tag name "nodeName" has two attributes whose values

are in the string format. In fact, character strings represent every XML node, being of any data

type. An XML document is therefore readable by a human. The "childNode" tag has "nodeName"

as its parent. "Parsed character data" can be retrieved by the user. XML is obviously not the

most compact data format, but it offers much more flexibility sinee the format is known and no

deserialization is needed in order for data to be extracted from a stream.

APPENDIX B

Implemented Components

B.l. Actions

B.1.1. Action "moveCursor".

entity's property.

Assigns the input token's position parameter to the world

• expected properties in the world entity: "position": Vector<AData>

• expected parameter in the input token: "position": Vector<AData>

B.1.2. Action "reset".

have the "deletable" behaviour.

Sets the "render" attribut es to its opposite value, for entities that

• expected properties in the world entity: needs to be the "World" object

• expected parameters in the input token: none

B.1.3. Action "traceCursor". Pushes the "position" vector retrieved from the input

token in the world entity's position list and pops the front item if the list's size is larger than the

"length" property value.

• expected properties in the world entity: "positionList": List<Vector<AData> >,

"length": 1 nteger

• expected parameters in the input token: "position": Vector<AData>

B.1.4. Action "delete". Sets the "render" attribute of the entity to "false".

• expected properties in the world entity: none


B.l ACTIONS

B.1.5. Action "placeImage2D". Creates a new world entity of type "image2D" using a

factory, then sets its name given an instance number. Add to the newly created entity the behaviours

"deletable", "translationPickable" and "translatable". Properties "ID", "iileName", "position" and

"picked" are added to the new entity, where the "fileN ame" originates from the action data map

and "position" from the input token.


• expected parameters in the input token: "name": St ri ng (optional), "applicationPoints":

Map<AData, AData>

B.1.6. Action "pick". Sets the validated entity's "picked" property to true.

• expected properties in the world entity: "picked": Boolean, "position": Vector<AData>

• expected parameters in the input token: "applicationPoints": Map<AData, AData>

• validation: performs an OpenGL picking to determine which entity the cursor is on, given

the application points. Tries to grab the lock on the picked entity

B.l. 7. Action "drop". Sets the entity's "picked" property to "false" if it was "true" and

releases the lock on the picked entity.

• expected properties in the world entity: "picked": Boolean


B.1.8. Action "translate". Assigns the input token's "position" attribute to the world

entity's "position" property, being two or three-dimensional.

• expected properties in the world entity: "picked": Boolea n, "position": Vector<AData>


• validation: an entity is valid if the "picked" property is set to "true"

B.1.9. Action "rotate". Assigns the input token's "position" attribute to the world en-

tity's "rotation" property with a given mapping.

• expected properties in the world entity: "picked": Boolean, "rotation": Vector<AData>


• validation: an entity is valid if the "picked" property is set to "true"

B.I.I0. Action "undo".

object.

Calls the "undoLastAction" method on the interaction manager

86

B.1 ACTIONS



B.1.11. Action "redo".

object.

CaBs the "redoLastAction" method on the interaction manager



B.1.12. Action "showSystemlnformation". Assigns the interaction manager's state to

the world entity's "text" property.

• expected properties in the world entity: "text": String


B.1.13. Action "stateChange".

change in the interaction manager.

Does not perform anything, it is only used for astate

B.1.14. Action "placeMode13D". Creates a new world entity of type "modeI3D" using

a factory, then sets its name given an instance number. Add to the newly created entity the

behaviours "deletable", "translationPickable", "rotationPickable", "translatable", "rotatable" and

"texturable". Properties "ID", "fileName", "position", "rotation", "textures" and "picked" are

added to the new entity, where the "fileName" originates from the action data map and "position"

from the input token.


• expected parameters in the input token: "name": String (optional), "applicationPoints":

Map<AData, AData>

B.1.15. Action "put Texture" . Given the picking results, set the object's texture to the

file name found in the action's data map and unlocks the entity.

• expected properties in the world entity: "textures": Vector<String> , "picked":

Boolean, "position": Vector<AData>

• expected parameters in the input token: "applicationPoints": Map<AData, AData>

• validation: performs an OpenGL picking to determine which entity the cursor is on, given

the application points. Tries to grab the lock on the picked entity

87

B.2 WORLD ENTITIES

B.2. World Entities

B.2.1. Image 2D. Displays a two-dimensional image in an orthographie OpenGL view

("Display2D"). The dimensions of the texture (height and width) should be a power of two. The ren

dering method first loads the texture whose file name is specified in the "fileName" property of type

String. It also places the loaded texture at the specified "position" property of type Vector<AData>.

B.2.2. Model 3D. Displays a 3D Studio Max model in a three-dimensional perspective

view ("Display3D"). The rendering method tries to load the 3D model whose file name is specified

in the "fileName" property of type String. The "position": Vector<AData> property is used to place

the 3D model in the 3D space, whereas the "rotation": Vector<AData> is used to set the rotation

angles, specified in degrees around each of the three axes. The "selected": Boolean property is used

to draw a bounding box around the 3D model, if set to "true". The "textures": Vector<String> is

used to apply textures on the sub-objects of the 3ds model. The OpenSceneGraph version acts the

same way, but displays any kind of supported model in a 3D OpenSceneGraph output modality.

B.2.3. Mouse Cursor. Displays a virtual mouse cursor in an orthographie view ("Dis-

play2D" ). The rendering method places the cursor on the screen according to the "position":

Vector<AData> property. It also assigns the cursor a color set by the "color": Vector<AData>

property. The shape of the cursor is a filled circle. There is also an OpenSceneGraph version of the

mouse cursor.

B.2.4. Mouse Trajectory. Displays a mouse fading trail in an orthographie view ("Dis-

play2D"). Lines are drawn from the mouse cursor according to the "positionList":

Vector<Vector<AData> > with an alpha parameter decreasing to zero for the last segment. The

"color": Vector<AData> property specifies the color in which the fading trail is displayed. There is

also an OpenSceneGraph version of the mouse trajectory.

B.2.5. Text 3D. Displays a three-dimensional string of text in a "Display3D". The character

string is set by the property "text": String, whereas the font file name is specified by the property

"fontName": String. The location of the text is set by the "position": Vector<AData> property.

There is also an OpenSceneGraph version of the 3D text.

88

BA OUTPUT MODALITIES

B.3. Input Modalities

B.3.1. Glove-based Gesture Recognition.

be recorded or played back.

Emitted tokens:

Interfaces a P5 glove whose input data can

• "moveCursor", which contains the "position": Vector<AData> parameter

• "still", emitted after half a second of absence of movement with a threshold specified in

the configuration file. The token parameters contain the gesture application point

B.3.2. Mouse-based Gesture Recognition. Interfaces a mouse with the RAWInput

API, whose input data can be recorded and played back.

Emitted tokens:


• "still", emitted after half a second of absence of movement. The token parameters contain

the gesture application point

B.3.3. Vision-based Gesture Recognition. Interfaces the video tracker which sends

hand positions through the network. Those positions can be recorded and played back.

Emitted tokens:


• "still", emitted after half a second of absence of movement with a threshold specified in

the configuration file. The token parameters contain the gesture application point

B.4. Output Modalities

B.4.1. Display 2D. Sets an orthographie view in screen coordinates.

B.4.2. Display 3D. Sets a perspective view with a viewing angle of 30 degrees. Translates

the entire world, such that the (0, 0) coordinate is at the center of the screen.

B.4.3. Open Scene Graph Display 2D. Adds an orthographie projection bran ch to the

rendering graph in order to add world entities that should be displayed on the screen plane.

B.4.4. Open Scene Graph Display 3D. Adds a position-attitude transform no de to the

rendering graph.

89

APPENDIX C

Sample XML Configuration File

<IIMF>

< InputHodalities >

< InputModali ty name-"mou8eGestures Il type-"DynamicGestures" >

<AXHL>

<map name="instance">

<string name-"type" value="MouseBasedHMMGestureRecognizer" />

<map name="data">

<string name-"dataFile" value-"mouseGesture-1arge.1"1>

<int name-"smoothingBuffer" value- n 3"/>

<int name-"buffer" value-"100"1>

<boc! name-"useContext" value-"true" />

<string name-"logFile_" value-"mouseGesture..recogni tionTestl.log. 13"1>

<string name-"logDirection" value-"input"/>

<boal name-"logNeedSleep" value-11true ll />

<boa! name-ntraining" value-"true" 1>

</map>

</map>

<string name-"mode" value-"events" />

<int name="frameRate" value="30 1' />

<string name="logFile_ Il value-"test .log" />

</AXML>

</InputHodality>

</InputHodali ties >

<OutputHodali ties >

<OutputHodali ty name-"canvas" type-"Display2D" >

< AXMl > <int name-"frameRate" value-"30"/>

</AXHL>

</OutputModality>

<OutputModality name-"userInterface" type-"Oisplay20">

< AXMl > <1nt name="frameRate" value="30"/>

<string name="behaviour" value="drawOnTop" />

< IAXML >

APPENDIX C. SAMPLE XML CONFIGURATION FILE

</OutputModali ty>

<OutputModality name-"3DEnvironment" type-"Display3D" >

<AXML>

<int name-"frameRate" value-" 30" 1>

< I!XML > </OutputModali ty>

</OutputModali Ues>

<World>

<Hook name-"Grid">

<AXML>

<int name-nIineNumber" value-"SO"/>

</AXML>

</Hook>

<Bahaviours>

<Behaviour name-"res.table" />

<Behaviour name-"placeable" />

<Bahaviour name-"undoabla" 1>

<Behaviour name-"redoable" 1>

<Behaviour name-"textureEnabled "1>

< IBehaviours >

<WorldEnti ty name-"text" type-"text3D" > < !XML >

<string name-"fontName" value-"Teen.ttf"/>

<string name-"text" value-"State information" />

<vector name-"posi tion" > <int name="x" value="-100" 1>

<int name="y" value="100"/>

<int name="z" value="-400" />

</vector>

< I!XML > <Behaviours>

<Behaviour name-" inf ormable " 1>

</Behaviours>

</WorldEntity>

<WorldEntity name-"chair" type-"mode13D">

<!XML>

<string name-"fileName" value-" .. \Data\Models\chair.3ds"l>

<vector name-"position">

<int name-"x" value-"O"/>

<int name="y" value="200"1>

<int name="z" va!ue="O"/>

</vector>

<vector name-"rotation">

<int name-"x" value_IIO" />

<int name-"y" value- II O"/>

<int name-"z" value"""'O"/>

</vector>

</!XML>

<Behaviours>

<Behaviour name-"deletable"l>

<Behaviour name-"selectable">

91


<AXML>

<boal name-"selected" value-"false" />

</AXML>

</Behaviour>

<Behaviour name-"translationPickable" />

<Behaviour name-"translatable" >

<AXML>

<string name-"ID" value-"mouseO" />

</AXML>

</Behaviour>

<Behaviour name-"rotationPickable"/>

<Behaviour name-tlrotatable" >

<AXML>

<string name="ID" value="mouseO"/>

</AXML>

</Behaviour>

<Behaviour name-"texturable"/>

</Behaviours>

</WorldEnti ty>

<WorldEntity name-"image" type-"image2D">

<AXML>

<string name-"fileName" value-" .. \Data\ test. bmp" />


<int name-"x" value-"150"/>

<int name-"y" value-"150"/>

</vector>

</AXML>

<Behaviours>

<Bahaviaur name-"selectable" />

<Bahaviaur name-"translatable">

<AXML>

<string name-"IO" value-"mouseO" />

</AXML>

</Behaviour>

<Behaviour name-"translationPickable" />

<Bahaviaur nam.e- n scalabla" />

<Behaviour name-"deletable"/>

</Behaviours>

</WorldEntity>

<WorldEnti ty name=" cursorLeft Il type="mouseCursor n > <Behaviours>

<Behaviour name-"mouseMovable">

<AXML>

<string name-"ID" value-"mouseO" />

</AXML>

</Behaviour>

<Behaviour name-"drawOnTop" />

< /Behaviours >

< AXML >

<map name-"color">

<int name-"r" value- 1I 25S"/>

92


<int names"g" value="255" />

<int name-"b" value_IIO" />

</map>

</AXML>

</WorldEnti ty>

<WorldEnti ty name-"trajectoryLeft" type-"mouseTraj ectory" >

<AXHL>

<map name-"color">

<int name-"r" value-"255" />


<int name-"b" value-" 0 " />

</map>

<int name="length" value="100"/>

</AXML>

<Behaviours>

<Sehaviour name-"mouseTraceable" >

<AXML>

<string name-"IDn value-"mouseO" />

</AXML>

</Behaviour>

<Behaviour name-"drawOnTop" 1>

< /Behaviours >

</WorldEntity>

<WorldEnti ty name-"traj ectoryRight" type-"mouseTrajectory" >

<AXML>

<map name="color">

<int name="r" value="O" />

<int name="g" value="255"/>

<int name-"b" value-IIO"I>

</map>

<int name-"length" value-" 100" />

</AXML>

<Behaviours>

<Behaviour name-"mouseTraceable" >

<AXML>

<string name-"ID" value-"mouse1 "I> </AXML>

</Behaviour>

<Behaviour name-"drawOnTop" />

</Behaviours>

</WorldEnti ty>

<WorldEntity name-"cursorRightIl type-"mouseCursor">

<AXML>

<map name:"color">

<int name-"r" value-"Q"/>


<int name-"b" value-"O"/>

</map>

</AXHL>

<Behaviours>

<Behaviour name-"mouseMovable">

93


<llMI.>

<string name-"ID" value-"mousel"l>

</AXML>

</Behaviour>

<Behaviour name-"dravOnTop"l>

< IBehaviours >

</WorldEnti ty>

</World>

<Grammar>

<Action type-Ilreset ll activationToken-"omega" when-"id!e" > <Behaviour name-"resetable"l>

</Action>

<Action type-"placeHarker2D_" activationToken=lIplacePoint" when:" idle" > <Behaviour name="placeable"l>

</Action>

<Action type-"stateChange" activationToken-"coeur" when-"idle" becomes·"placing">

<Behaviour name-"placeable"l>

</Action>

<Action type-"placeMode13D" activationToken-"croix" vhen-"placing" > < AXML >

<string name-"ID" value-"mouseOII/>

<string name-"fileName" value-no .\Oata\Models\chair.3ds"l>

</AXML>

<Behaviour name-"placeable" 1>

</Action>

<Action type="placeModel30" activationToken="triangle" vhen="placing">

<AXML>

<string name="ID" value-"mouseO" />

<string name-"fileName" value-" .. \Oata\Models\fridge. 3ds"l>

< IAXML > <Behaviour name-"placeable"l>

</Action>

<Action type-"placeMode13D" activationToken-"carre" vhen-"placing">

<AXML>

<string name-"ID" value-"mouseO"/>

<string name-"fileName" value-" .. \Oata\Models\plantOl. 3ds" 1>

</AXML>


</Action>

<Action type=" stateChange" activationToken=" coeur" when="placing" becomes=" idle" > <Behaviour name="placeable" 1>

</Action>

<Action type- l placelmage2D" activationToken-"croix" when-"idle">

<AXML>

<string nam.-"ID" value,.lImouseO"/>

<string name-"fileName" value-" .. \Oata\test.bmp"l>

</AXML>


</Action>

<Action type-"delete" activationToken-"delete" when-"idle">

<Behaviour name-"deletable" >

94


<AXML>

<vector name-"posi tian" />

</AXML>

</Behaviour>

</Action>

<Action type-11moveCursor" activationToken-"moveCursor" when-"anyIl > <Behaviour name- "mouseMovable Il >

<AXML>


<int nama-"x"l>

<lnt name-Hy"/>

</vactor>

</AXML>

</Bahaviour>

</Action>

<Action type-11traceCursor" activationToken-"moveCursor" when-" any" > <Behaviour name-"mouseTraceable" >

<AXML>

<list nama-"posi tionList" >

<vactor nama-"position">

<int nama-"x" 1>

<int nama-"y"l>

</vactor>

</list>

</AXML>

</Behaviour>

</Action>

<Action type="translate" activationToken="moveCursor" when="translating">

<Behaviour name-"translatable" >

< AXML >

<vector name-"position"/>

</AXML>

</Bahaviour>

</Action>

<Action type-"rotate" activationToken-"moveCursor" when-"rotating">

<Behaviour name-"rotatabla">

<AXML>

<vector name-"rotation" />

</AXML>

</Bahaviour>

</Action>

<Action type-lipick" activationToken-"whi te" when- II idle" becomes·lltranslating" > <Behaviour name-"translationPickable" >

<AXML>

<vector name-"posi tion" />

<bool name-"pickad"l>

</AXML>

</Bahaviour>

</Action>

<Action type-"pick" activationToken-"carre" when-"idle " becomes-"rotating">

<Bahaviour name-"rotationPickable" >

95


<AXML>


<bool name-"picked"l>

< IAXML > < IBehaviour >

</Action>

<Action type-"stateChange" activationToken-"texture" when-"idle " becomes-"texturing ll > <Behaviour name-"textureEnabled" 1>

</Action>

<Action type-"stateChange" activationToken-"texture" when-"texturing" becomes-"idle" > <Behaviour name-"textureEnabled"l>

</Action>

<Action type="putTexture" activationToken="croix" when="texturing">

<AXML>

<string name-"textureName" value-" .. \Data\Models\03700447. bmp" 1>

</AXML>

<Behaviour name-"texturable">

<AXML>

<vector name-"textures" />

</AXML>

</Behaviour>

</Action>

<Action type-"putTexture ll activationToken-"start" when-"texturing" > <AXML>

<string name-HtextureName" value-" .. \Data\Models\dchrfab. bmp"l>

</AXML>

<Behaviour name="texturable">

<AXML>


< IAXML > < IBehaviour >

</Action>

<Action type-"putTexture" activationToken-"stop" when-"texturing">

<AXML>

<string name-HtextureName" value-" .. \Data\Models\couch. bmp\ HI>

</AXML>

<Behaviour name-"texturable">

<AXML>


</AXML>

< IBehaviour >

</Action>

<Action type-l'drop" activationToken-"still" when-"translating" becomes-"idle">

<Behaviour name-"translationPickable" >

<AXML>


<bool name-"picked" 1>

</AXML>

< IBehaviour >

</Action>

<Action type-"drop" activationToken-"still" when-"rotating" becomes-"idle">

96

<Behaviour nama-"rotationPickabla">

<AXML>


<bool name-"picked" 1>

< IAXML > </Behaviour>

</Action>

<Action type-"drop" activationToken-"dropll when-"any" becomes-"idle">

 </Action>

<Action type-"undo" activationToken-"pointe" when-"idI." becomes-"idle" > <Behaviour name="undoableIl />

</Action>

<Action type-"redo" activationToken-Itpointe_inv" when-nidl." becomes·"idle">

<Behaviour name-"redoable 1' />

</Action>

<Action type-"shovSystemInformation" activationToken-lisystemlnfo li when-"any">

<Behaviour name-"informable">

<AXML>

<string name-"text" />

</AXML>

</Behaviour>

</Action>

<Action type-"stateChange" activationToken-"white" when-"placing">

<Behaviour name="placeable"l>

< ! -- Just a dlllllllly action that acts in f act as a garbage model. .. -->

</Action>

<Action type-"stateChange" activationToken-"white" when-"texturing">

<Behaviour name-"textureEnabled" 1>

<! -- Just a dummy action that acts in fact as a garbage model. .. -- >

</Action>

< IGrammar > <Network>

<Connection type-"server" port-"76849" 1>

</Network>

</MMF>

97

APPENDIX D

User Mannal

D.l. Introduction

The current software framework is intended to be used for multimodal applications as a virtual

world model and standard interfaces for input and output modalities. Several different modalities

were implemented, namely mouse, glove and vision-based gesture recognition systems on the input

side, and OpenGL and OpenSceneGraph on the output side. This manual first describes the pre

requisites and installation instructions. Directions on how to adapt the framework for specifie needs

are then presented, while stating concrete examples throughout the description.

D.2. Prerequisites

The software framework depends on several freely available software libraries, which provide

classes that implement mechanisms for improved generality. The following list states those prereq

uisites:

• ACE OS Wrapper: C++ library that acts as an operating system wrapper in order for

users to write OS-independent code. ACE provides classes for threads, network sockets

as weIl as many other functions that would not otherwise be standard on every operating

system

• Xerces-C++: portable XML reader that provides a DOM representation of a file, and

utilities in order to retrieve the different parameters and attributes

• LTI-Lib: portable C++ library that implements mathematical operations commonly used

in computer vision and artificial intelligence

D.3 INSTALLATION

• OpenGL: portable environment for developing 2D and 3D graphies applications

• FTGL: portable library used to display fonts in a three-dimensional OpenGL window

• AData: portable C++ library that acts as a common generic data format in the entire

framework

• Server (optional): C library employed for communication between the vision-based tracker

available in the SRE and the corresponding input modality implemented in the current

framework

• OpenThreads, Producer and OpenSceneGraph (optional on Windows): set of portable

C++ libraries that implement a scene graph and display system in order to represent data

that is to be rendered by OpenGL. It also provides utilities needed to manage the mouse

and keyboard interfaces

D.3. Installation

The software framework was tested on Windows XP and the Fedora Core 3 Linux distribution.

The general components are compiled and linked in a shared library in order to be used by your

application, but the compilation process that will be described in the following sections is different

for the two platforms.

D.3.1. Microsoft Windows Installation. Prerequisites: Download, compile and install

the aforementioned libraries. Be sure to add to the "PATH" environment variable to every directory

in which the binary files of each library are located.

Software framework:

(1) Get the source code, project and data files from the CYS repository

(2) Open the solution "MultimodaIFramework.sln" with Microsoft Visu al Studio .NET or

later1

(3) Choose "Build", "Batch build" select every project and click "Build"

(4) No compilation or linking errors should be encountered, but if it is not the case, check the

"include" and "library" paths that should be composed of the ones associated with the

different prerequisite paths. Those settings can be added in the menu: "Tools", "Options",

"Projects", "VC++ Directories"

1 Visual Studio 6.0 will not compile with the LTI-Lib because of heavy use of templates which are not correctly supported in this old version.

99

D.4 HOW TO EXTEND THE FRAMEWORK?

(5) Two applications are available that can be set as the start-up project: an MFC-based

user interface (Project "MultimodaIFramework") or an OpenSceneGraph-based software

(Project "OSGFramework")

(6) Press "F5" for debugging the application or "CTRL-F5" to run the program

D.3.2. Linux Installation. Prerequisites: Download, compile and instaIl the aforemen-

tioned libraries. If the instaIled shared libraries are not located in standard directories, be sure to

add their paths to the "LD-LIBRARY J>ATH" environment variable, or to the "/etc/ld.so.conf" file

and run the "ldconfig" command subsequently.

Software framework:

(1) Get the source code, Makefiles and data files from the CYS repository

(2) Type "make aIl" in order to build every implemented shared and dynamic library as weIl

as an OpenSceneGraph-based application

(3) If there are compilation or linking errors, check the include and library paths that should

be composed of the prerequisite libraries' paths

(4) In order to run the application, change the directory to "OSGFramework" and run "./OS

GFramework"

0.4. How to Extend the Framework?

You have to adapt several software components, summarized in Table D.l, for your specific

applications.

D.4.1. Input Modality. In order to initiate the data pipeline, you have to instantiate input

modalities. The currently implemented input modality is a continuous dynamic gesture recognizer

that interfaces a mouse, a data glove and a video-based tracker. Other examples of input modalities

are static gesture recognizer, speech recognizer, gaze tracker or any other device that would provide

information on the user's status.

In order to specialize an input modality, the foIlowing methods are available for overloading:

• init(const Data::AData &data): provides initialization data that you specify either from an

XML file or hard coded values

• finiO: acts as a termination method in order to clean up dynamically aIlocated objects

100

DA HOW TO EXTEND THE FRAMEWORK?

Component Purpose Overloaded methods (italic = abstract)

Input modality Get data from logical entities, being in init, emitToken, fini, start, the real or virtual world stop

Output modality The output side of the stream, instan- init, fini, renderWorld tiated in order to render virtual data in the real world

World entity Logical entity representing a virtual ob- init, fini, render ject that owns properties and shows be-haviours

Action Logical entity that acts on world en- doApply, validateEntities tities by changing their properties ac-cording to the action's specifications

World ho ok Contain data that is rendered by output init, render modalities, on which input modalities do not have any influence

TABLE D.l. Summary of software components that need to be specialized

• startO: typically activates the input modality's thread

• stopO: typically terminates the input modality's thread

• emitToken(lnputToken *token): normally called to set modality-specific parameters in the

input token and add it to the interaction manager's queue. The input token is created in

the modality's thread, from which the emitToken calI originates

D.4.2. Output Modality. In order to render virtual objects in the real world, you have to

instantiate output modalities. The currently implemented output modalities are 2D and 3D views,

using either pure OpenGL primitives or OpenSceneGraph for data structure and rendering man-

agement. Other possible output modalities are sound output, haptics devices or more sophisticated

projection systems.

In order to specialize an output modality, the following methods are available for overloading:



• finiO: acts as a termination method in order to clean up dynamically allocated objects

• renderWorldO: typically invoked to perform entity-independent operations that must be

performed before calling the render method on every world entity, which is the default

behaviour of this method

101

D.4 HOW TO EXTEND THE FRAMEWORK?

D.4.3. World Entity. A virtual world is composed of world entities that are to be added

during the initialization stage or at runtime by corresponding actions. An entity is defined as a

logical representation of an object. It contains named properties in the form of AData, which are

generic data containers, removing the type constraint. Typical world entities can be a 3D model, a

2D image, a mouse cursor, a 3D sound object or a bumpy virtual surface.

In order to specialize a world entity, the following methods are available for overloading:



• finiO: used as a termination method in order to clean up dynamicaIly aIlocated objects

• render(OutputModality *modality): perform the function caIls needed to render the world

entity in the given output modality. A dynamiccast operation is typicaIly performed on

the modality pointer in order to discover in which modality type the rendering calls are to

be made

D.4.4. Action. Actions are instantiated in order for data originating from input modalities

to affect the world entities such that they will influence properties, given data packed in an input

token. Typical examples of actions can be placing a 3D model, translating an entity, rotating an

entity and changing the color or texture of a model.

In order to specialize an action, the foIlowing methods are available for overloading:

• doApply(WoridEntity *entity, InputToken *token): modifies the entity's properties according

to data located in the input token. The doApply method is called if the entity's behaviours

correspond to the action's requirements, which is verified in the interaction manager. A

more detailed description of this pro cess is beyond the scope of this user manual and can

be found in Section 4.6

• validateEntities(std::list<WoridEntity*> &entitiesList, InputToken *token): used to select

amongst a list of possible entities the ones to which the action has to be applied, given

an input token. A validation method calI typically occurs when entities from the virtual

world have to be picked by a user pointing to a specifie location

102

D.5 PUTTING IT ALL TOGETHER

D.4.5. World Hook. A virtual world might contain static data, which is to be rendered

in the real world. A typical example of a world hook could be the virtual room in which you are

located, or a background sound over which you do not have any influence.

In order to specialize a world entity, the following methods are available for overloading:

• init(const Data::AData &data): provides initialization data to the hook that you specify

either from an XML file or hard coded values

• render(OutputModality *modality): renders the ho ok in the specified output modality. A

dynamiccast operation on the modality parameter is typically performed in order to ensure

that the hook can be rendered

D.5. Putting it AlI Together

In order to use the framework effectively, the specialized components detailed above have to be

instantiated, initialized and started. In order to facilitate this process, you can write an XML file

that specifies the type and parameters of every software component. The only object that you need

to instantiate explicitly is one of type Instance. A call to the init method with an XML file path

as an argument then creates and initializes the components. A subsequent call to the start method

activates the input modalities, hence making the data flow through the pipeline until reaching the

output modalities. There is however one subtle technicality with the latter class objects. In the

majority of display software, the rendering methods are always called from the main loop. Therefore,

you must first retrieve the instantiated output modalities from the OutputManager object and then

call the renderWorid method on each of them, in the main program loop.

The objects instantiation relies on factories, which know how to create concrete instances. The

specific factories have to overload the following methods: createlnputModality, createOutputModality,

createWorldEntity and createAction, each taking a character string as an argument and returning

a pointer to the newly created object or the NULL pointer. A user-defined factory is registered

automatically in the factory manager as soon as one instance of it is created.

For increased flexibility, you can build shared libraries that will be loaded dynamically at

runtime. It is particularly interesting to employ that scheme in order to limit the dependencies

between the different software modules. The object factories try to open a shared library that has

103

D.6 HOW TO USE THE GESTURE RECOGNIZER?

the same name as the requested object type. 2 When the library loads successfuBy, the factory looks

for the ObjectFactory symbol that you must define using the GENERILFACTORY_MACRO(CHILD,

PARENT), where CHILD is the child class type and PARENT the parent type, which is to be returned

by the ObjectFactory function.

D.6. How to Use the Gesture Recognizer?

A continuous gesture recognizer was implemented as a proof of concept for the current frame-

work. Since it is the most important feature that has to be controBed interactively, this section

shows you how to train gesture models and how to obtain reasonable recognition results afterwards.

D.6.!. Training Procedure. In order to interact with the software, you use the keyboard

as an input to the gesture trainer. The available commands are summarized in Table D.2.

Command Parameters Purpose f File name (op- When a parameter is specified, this command sets

tional) the gesture database file name to the provided pa-rameter. Otherwise, it prints the current file name

1 none Lists the currently available models and IDs m Model ID (op- When a parameter is specified, this command sets

tional) the current model ID to the provided parameter. Otherwise, it prints the current model ID

0 none Opens the file whose name was specified by the file name command and loads the gesture models

r none Resets the current model training data R none Resets aB models and training data s none Saves the models in the file whose name is specified

by the set file name command t on/off (op- Toggles the training/recognition status or sets it to

tionaI) the value specified as a parameter ? none Displays a help message describing aB the available

commands return none Starts or stops the training of a gesture whose model

ID was specified by the corresponding command

TABLE D.2. Summary of the gesture trainer keyboard interface commands

The training stage that you must perform in order to obtain gesture models is actually managed

by the preceding commands. You add new gestures with the "m" command and start capturing

data by pressing the <RETURN> key, and another one to stop the capture and perform a training

2 A "d" is concatenated at the end of the class name if the _DEBUG flag is set.

104

D.6 HOW TO USE THE GESTURE RECOGNIZER?

pass. Normally, you should perform in the order of twenty gestures to acquire reliable models. You

can see in the output console the average score that results in performing the recognition on all

the training sequences provided, and the number of gesture samples involved in training the current

model. To ensure a properly trained model, be sure that the score is not in a local minimum, and

continue adding training sequences until it converges to a reasonable value. You should also make

sure that you train the models with sufficiently sparse data, such that a minimal number of errors

will occur during the recognition stage. The following describes the steps that should be followed in

order to achieve a complete training procedure:

• Type the command "R" in order to reset the current models

• Repeat until every gesture model is trained:

• Type "m <gestureID>" in order to set the current gesture ID

• Repeat until the current model has converged:

• Press <RETURN> to start capturing positions

• Perform the actual gesture with the position capturing device

• Press <RETURN> again to stop capturing positions

• Type "f <filename>" to set the file name in which the models will be saved

• Type "s" to save the models in the file

When the models have already been saved in a file and you want to load them in your gesture

database, you have to follow this procedure:

(1) If you do not want to keep the gestures that are already in your database, type the

command "R", which resets and removes every model

(2) Type "f <filename>" to set the file name that you want to load

(3) Type "0", which will open the file and load the gesture models in your database

D.6.2. Recognition Procedure. After having trained your gesture models, you are now

ready to perform recognition with positions that you continuously provide. The recognition algo

rithm will spot gestures over time and emit tokens when appropriate conditions are met. You should

issue the command "t off" in order to initiate the recognition algorithm.

In order to obtain the best recognition results from your sequences:

105

D.7 TROUBLESHOOTING AND ADVICES

• choose gestures that are dissimilar for multiple actions that can be applied concurrently

to one object.

• when a deletion error occurs, repeat the gesture several times. If it is still not being

spotted, move away from the target point and return to repeat the gesture. Moving away

should reset the wrong hypotheses that were confusing the recognizer.

• when a substitution or insertion error occurs and results in an incorrect action invocation,

use the undo feature.

• ensure that the gesture is performed precisely on top of the object to which you want to

apply when the environment is cluttered with several objects. Since the gesture spotting

is performed automatically, the starting point is sometimes falsely identified, which results

in incorrect selection of the entity.

D.7. 'Iroubleshooting and Advices

Several runtime errors can occur when an instance is started. The best way to retrieve these

is to track them in the output console, which displays several messages as to where an error could

have occurred. There is also a convention in the return value of several interface methods that you

should observe carefully. Most of the interface methods return an integer value that contains an

error message. A value of "0" means that the method completed correctly. A value of "-1" means

that an error occurred and that you should abort the current procedure and fix it. Other values are

reserved for other error messages, for which a convention should be applied.

Logging facilities are also provided in order to output data coming from input modalities to a

file. Reading back the file and using the data as input provides a way to capture data only once and

to tweak the different parameters and grammar thereafter.

Finally, when you create new classes that extend the framework, be sure to respect the stan

dards that were set by the author. Review the code that implements input and output modalities,

actions, world entities and world hooks, and adopt the same coding style. The source code is amply

commented, which should guide you in writing applications that are based on the current framework.

106