Sign Recognition by Hand Tracking

1

University of Victoria

Faculty of Engineering

Tracking hand motion with a color dotted glove for sign language recognition

ASM SaeedulAlam

Electrical Engineering

[email protected]

September 4, 2012

2

TABLE OF CONTENTS

List of Tables and Figures ............……………………………………………………………………………………….iii

Summary………………………………………………………….………………………………………………………………..iv

1.0 Introduction…………………………………………………………………………………………………………………. 1

2.0 American Sign Language (ASL) ……………………………………………………………………………………..1

3.0 Hand Tracking Technologies…………………………………………………………….……………………………2

3.1 Tracking with interface………………………………………………………………………………………3

3.1.1 Optical Tracking………………………………………….……………………………………3

3.1.1.1 Marker Systems……………………………………………………………………3

3.1.1.2 Silhouette Analysis…………………………………..…………………………..4

3.1.2 Magnetic tracking……………………………………….……………………………………5

3.1.3 Acoustic tracking……………………………………….……………………………….…….6

3.2 Glove Tracking……………………………………………………………..……………………………………6

4.0 Interpreting sign language…………………………………………………………………………….………………6

4.1 using a recurrent neural network…………………………………………………….………………..7

4.2 using hidden Markov model………………………………………………………………….…….…….9

5.0 Proposed Methodology……………………………………………………………….………………….…………..10

5.1 Glove Design………………………………………………………………………………………...…………10

5.1.1 Glove vs. Bare hand tacking………………………………………..….……………….12

5.2 Rasterizing the frame…………………………………………………………….…………………..…….12

5.3 Color and pixel correcting………………………………………………………..…….…………………12

5.4 Indexing the library database………………………………………….……………………………….13

5.5 Matching and tracking……………………………………………………………………………………..14

5.6 Pose estimation and finding the nearest neighbor……………………………….……………14

5.7 Blending nearest neighbor………………………………………………………………………………..15

5.8 Recognizing ASL…………………………………………………………………………..……………………15

6.0 Discussion……………………………………………………………………..…………………………………………….18

7.0 Conclusion…………………………………………………………………………………………………………………….22

8.0 Recommendation………………………………………………………………………………………………………….22

References………………………………………………………………………………………………………………………….23

3

LIST OF TABLES AND FIGURES

FIGURES

Figure 2.0: Usage of ASL………………………………………………………………………………………….……….2

Figure 3.1.1.2: Krueger’s manipulation of graphics by hand………………………………………………5

Figure 4.1.1: Sign language word recognition system by using recurrent neural network....8

Figure 4.1.2: Recurrent Neural Networks……………………………………………………………….…………8

Figure 4.2: The four states HMM used for recognition………………………………………………..…..10

Figure 5.1: Glove Design……………………………………………………………………..……………………………11

Figure 5.1.1: Bare hand estimation and edge detection…………………………………………….………12

Figure 5.6: Hausdorff-like image distance……………………………………………………..…………………..15

Figure 5.8 Interpretation of sign language alphabets………………………………………………..……….16

TABLES

Table 5.3: Estimating neighboring color……………………………………………………………………………13

Table 6.1: Technical details comparison……………………………………………………………………………..18

Table 6.2: Comparison of advantage and limitations………………………………………………….……….20

4

Summary

Sign language is an essential communication toolkit for deaf people. Sign language use is not

associated with a specific ethnicity, location, or even household. Rather, people learn ASL

because they are deaf, hearing impaired or, less commonly, speech impaired, or because they

have family or friends who sign. As sign language is not practiced in all walks of human life, a

disabled person faces difficulties in daily life conversations. To solve this problem, hand and

finger gesture can be tracked and sign language can be recognized using specific computer

system and further translated using a voice output device.

For last two decades hand-tracking systems have been widely used in industrial applications,

virtual reality and medicine fields, but due to their expense and complicity their deployment

has been limited to regular customers. The purpose of this report is to reduce the gap of

communications between deaf and hearing people by developing an inexpensive and simple

hand tracking system that can be used interpreting and translating sign language. This report

focuses on interpreting American Sign Language (ASL) because of its practice all around the

world. Different hand tracking technologies and two possible strategies of interpreting ASL has

been discussed and compared in this report in terms of their advantages and limitations. Based

on the result, the report proposes a consumer hand tracking system using a real-time data

driven pose estimation technique with only a webcam and a polymer glove with a specific

pattern. The glove design and simple algorithms enables to employ a nearest-neighbor

approach to track hands at interactive rates. The tracking motion can be interpreted and

translated using a sign language recognition library.

5

Glossary

Silhouette A silhouette is the image of a person, an object or scene

represented as a solid shape of a single color, usually black, its

edges matching the outline of the subject

Virtual Reality Virtual reality (VR) is a term that applies to computer

simulated environments that can simulate physical presence in

places in the real world, as well as in imaginary worlds.

Neural networks The term neural network was traditionally used to refer to a

network or circuit of biological neurons. The modern usage of the

term often refers to artificial neural networks, which are

composed of artificial neurons or nodes.

Rasterisation Rasterisation is the task of taking an image described in a vector

graphics format (shapes) and converting it into a raster

image (pixels or dots) for output on a video display or printer, or

for storage in a bitmap file format.

Degree of Freedom (DOF) The number of degrees of freedom is the number of values in the

final calculation of a statistic that are free to vary.

Hausdroff distances Hausdroff distance measures how far two subsets of a metric

space are from each other. It turns the set of non-

empty compact subsets of a metric space into a metric space in its

own right. It is named after Felix Hausdorff.

Gaussian radial basis kernel Gaussian radial basis kernel is a real-valued function whose value

depends only on the distance from the origin.

.

6

1.0 Introduction

Conveying meaning using hand shapes, body language and expression, otherwise known as

sign language is an essential tool for visual, hearing and speech impaired people to establish

communication with other parties. Due to its diverse pattern and complexity, a disabled person

who excels at sign language often fails to communicate with non-sign language knowing

listener. As sign language is not practiced in all walks of human life, a disabled person faces

difficulties in daily life conversations. Speech synthesizer or generator (e.g. text to speech) is

widely used by people with visual impairment or reading disabilities, but communicating with a

listener using this type of device is never lucid and also it requires a significant amount of time

to make conversations. Computer recognition of sign language can be used for enabling

communication with hearing, visual or speech impaired people. Articulated finger tracking

systems have been widely used in professional and scientific arenas, but they are rarely

developed for consumer applications because of their price and complexity.

So far hand tracking system has been developed using different technologies, e.g. optical

tracking of LED or infrared reflecting markers, imaged based visual tracking, magnetic tracking,

acoustic tracking, LED gloves or digital data entry glove etc. In this report, different methods of

finger tracking which can be used for computer recognition of sign language would be

discussed. Based on facts and research, the paper would propose a possible idea of a simple but

effective system for real-time tracking hand motion that only requires a webcam and a cloth

glove with color markers placed at a custom pattern. A database library would be created for

the reference and a estimation pose would be deducted to confirm the track.

2.0 American Sign Language (ASL)

American Sign Language (ASL) is the language of choice for most deaf people in the United

States, Canada and Africa [1]. ASL is a sign language in which the hands, arms, head, facial

expression and body language are used to speak without sound. ASL features an entirely

different grammar and vocabulary from normal phonetic languages such as English [2].Although

7

the number of ASL speakers is unknown, there were 2.5 million deaf people in United States in

the year 2000, who were dependable on ASL [1]. Availability of ASL throughout the world is

shown in Figure 2.0.

ASL's grammar allows more edibility in word order than English and sometimes uses

redundancy for emphasis. ASL uses approximately 6000 gestures for common words and

communicates obscure words or proper nouns through finger spelling [2]. Because of ASL’s

availability and ease of use, this report chose ASL as its conducting sign language.

ASL is the

national sign

language.

ASL is used

alongside other

sign languages.

Insignificant

use of ASL

Figure 2.0: Usage of ASL [2]

3.0 Hand TrackingTechnologies

The tracking system is developed focusing on user-data interaction. The objective is to establish

a communication between human and computer interactions and use the virtual data to

synchronize with the hand gestures. Different types of tracking technologies have been used so

far to track the 3D position of hand and to capture finger configuration. The history of hand

tracking goes back to post-WWII development of master slave manipulator arms and during

Renaissance with development of the pantograph [3]. In this section, different types of tracking

8

technologies will be discussed emphasizing on their advantages and limitations. These

technologies can be divided into interface tracking and glove technologies. Tracking with

interface uses optical, magnetic, or acoustic sensing to determine the 3-space position of the

hand. Glove technologies use an electromechanical device fitted over the hand and fingers to

determine hand shape [4].

3.1 Tracking with interface

In this system, hand position can be tracked by following orientation of hand and finger

configuration. Position tracking could be done using following three technologies [5];

Optical tracking, using a single or multiple cameras from a certain distance.

Magnetic Tracking, Radiating a magnetic pulse from a fixed source.

Acoustic Tracking, using triangulation of ultrasonic wave to locate the hand.

3.1.1 Optical Tracking

In optical tracking, small markers are put on the major bone segments of the body. The markers

might emit infrared waves and could be either LEDs or reflecting dots.Single or multiple

cameras are used to capture the motion of the subject along with the markers. The software

system integrates those markers in 2D coordination and triangulates to calculate 3D position

for each marker [6]. Another method uses a single camera to capture the silhouette image of

the subject, which is analyzed to determine positions of the various pans of the body and user

gestures.

3.1.1.1 Marker Systems

Using of flashing infrared LEDs as a marker have been used widely in medical and

entertainment industry.In each hand, makers are placed in each “operative” finger. Cameras

are responsible for capturing each marker and measure its positions. Two types of marker

system have been developed to record the motion of limbs of body;

Infrared LED system such asSelspot® [7], Op-Eye®, and Optotrak®[8].

9

Reflective marker systems such Elite® and Vicon Avalon®[9]. .

Limitations

1. High processing time is required to analyze several camera images and to determine

each markers 3D position. [6]

2. Complex algorithm is needed to infer pose estimation. [7]

3. Multiple cameras needed to accurately distinguish the ambiguities when markers

coincide in visual field.

4. Inability to resolve ambiguities restricts its use to track fingers in interactive application.

3.1.1.2 Silhouette Analysis

Silhouette analysis of an image can easily distinguish body parts such as head, legs, arms and

fingers.Myron Krueger successfully analyzed complex motions in real time by processing

silhouette images using a custom hardware. Based on his technique, he developed a wide

collection of interactions and games without using gloves or goggles. The movements and

actions were integrated into his system called Videoplace® [10].Inspired by Krueger’s work,

Pierre Wellnerdeveloped DigitalDesk® [11]. The idea behind DigitalDesk is to mount a video

camera above a ordinary physical desk, pointing down at the work surface. Processing the

camera output, the system can determine whenthe user’s point (using a LED-tipped pen) or

gestures above a real or projected object. This allows the user to run and edit a projected text

file or a calculator by making gestures (figure).

10

Figure 3.1.1.2: Krueger’s manipulation of graphics by hand, fingertips controlling a spinal curve

[10].

Limitations

1. The consumer grade camera usually has low fps speed (24fps-60fps). This makes difficult to

track rapid moving fingers.

2. Poor resolution (less than 300 dpi) of the consumer grade camera makes difficult to

determine the location point of the fingers as they occlude each other and are occluded by the

hand [5].

3. Complex algorithm and technique is needed to interpret complex real time motions.

3.1.2 Magnetic tracking

Magnetic tracking technology is quite robust and widely used for single or double hand-tracking

[12]. Magnetic tracking uses a source element radiating a magnetic field and a small sensor that

reports its position and orientation with respect to the source. Magnetic systems do not rely on

line-of-sight observation like optical and acoustic systems.But metallic objects in the

environment would distort the magnetic field, giving erroneous readings. They also require

cable attachment to a central device (as do LED and acoustic systems) [5].Polhemus FASTTRAK®

and Ascension TechnologiestrakSTAR® provide various multisource, multi-sensor magnetic

systems that will track a number of points at up to 100 Hz in ranges from 3 to 20 feet [13], [14].

11

3.1.3 Acoustic tracking

Acoustic tracking uses high-frequency sound to triangulate a source within the work area. Most

systems such as Logitech[15] andMattel Power Glove[16] sends out pings from the source

(usually mounted on the hand) received by microphones in the environment. Precise placement

of the microphones allows the system to locate the source in space to within a few millimeters.

These systems rely on line-of-sight between the source and the microphones, and can suffer

from acoustic reflections if surrounded by hard walls or other acoustically reflective surfaces.

Multiple acoustic trackers must operate at non conflicting frequencies, a strategy also used in

magnetic tracking [5].

3.2 Glove Tracking

Motion tracking with gloves instrumented with sensors or gloves which emit or reflect infrared

light performs accurate result [12]. These techniques also give real-time results but are

expensive and may put some constraints on the possible hand movements.Inspired by Rich

Sayre’s work of world’s first data glove *17+, Thomas et aldeveloped an inexpensive, light

weight glove by using flexible tubes with a light source at one end and a photocell at the other

end. He used voltage from each photocell to correlate with finger configuration [5]. In 1983,

Gary Grimes developed Digital Data Entry Glove to recognize sign languages for the first time.

He used a cloth glove with specifically positioned numerous sewn sensors to track finger

movement [18]. . Thomas Zimmerman’s Data glove®[19], Dexterous HandMaster® (DHM) [20]

and VPL®DataGlove [19] are the finest examples of modern glove tracking technologies. The

advantage of using glove technologies are faster response time, minimum environment

restriction, availability in industry and minimum data loss after occlusion identifies. On the

other hand, relying on the software for data resolution and high expense are the only

limitations [6].

4.0 Interpreting sign language

Sign language recognition from static and dynamic hand gestures has been an active area of

research for last two decades. While there are many different types of gestures, the most

12

structured sets belong to the sign languages. In sign language, where each gesture already has

assigned meaning, strong rules of context and grammar are applied to make recognition

tractable. To date, most work on sign language recognition has employed expensive

“datagloves" which tether the user to a stationary machine [21] or computer vision systems

limited to a calibrated area [22].Current successful gesture recognition system is based on

computer vision technology and Virtual Reality (VR) [23]. The VR glove-based gesture

recognition systems use a VR glove to extract a sequence of 3D hand configuration sets which

contain finger orientation angles, and use various structures of neural networks[24] or Hidden

Markov Models (HMM) [25] to recognize 3D motion data as gestures.

4.1 Using a recurrent neural network

An artificial neural network (ANN) can be defined as a hugely parallel distributed processor

consists of simple processing units (figure 4.1), which has a natural tendency for storing

experimental knowledge and available it for use [24]. ANN consists of many interconnected

processing elements (figure) [25] which is used searching for identification and control

gestures, game-playing and decision making, pattern recognition and medical diagnosis [26].

Also ANN has the ability to adaptive self-organizing [25].

Manar [27] used two recurrent neural networks architectures for static hand gesture to

recognize Arabic Sign Language (ArSL); Elman recurrent neural networks and fully recurrent

neural networks [Figure 4.2 ]. Digital camera and a colored glove were used for input image

data. RGB color classification was used to segment the video frames. Thirty segmented features

of the hand image were then extracted and grouped to represent single image. Angles and

distances were measured between the fingertips and the wrist. 900 colored images were used

for training set, and 300 colored images for testing purposes. Results had shown that fully

recurrent neural network system (with recognition rate 95.11%) better than the Elman neural

network (89.67%) [27].

13

Figure 4.1.1: Sign language word recognition system by using recurrent neural network [25]

Fully recurrent neural networks Elman recurrent neural networks

Figure 4.1.2: Recurrent Neural Networks

Data Glove

Verfiying the start point(neural network for

posture recognition)

Sign language

recognition (Recurrent

neural network)

Result

Verifying the sampling endpoint (History)

Start

End

14

4.2 Using Hidden Markov Model

A hidden Markov model (HMM) is a statistical Markov model in which the system being

modeled is assumed to be a Markov chain [Figure] with unobserved (hidden) states. A Hidden

Markov Model can be defined by [28]:

A set of states ∁ =𝐶1 + 𝐶2 where 𝐶1 is an initial state and 𝐶2 is a final state.

The transition probability matrix 𝑀 = 𝑚𝑖𝑗 , where 𝑚𝑖𝑗 is the transition probability of

taking the transition from state i to state j.

The output probability matrix 𝑁 = 𝑛𝑗 (𝐴𝑘) for discrete HMM and 𝑁 = 𝑛𝑗 (𝑥) a

continuous HMM where 𝐴𝑘 stands for a discrete observation symbol, and 𝑥 stands for

continuous observations of k-dimensional random vectors.

For a discrete HMM 𝑚𝑖𝑗 and 𝑛𝑗 (𝐴𝑘) have the following properties:

𝑚𝑖𝑗 ≥ 0, 𝑛𝑗 𝐴𝑘 ≥ 0

𝑚𝑖𝑗 = 1

𝑗

𝑛𝑗 𝐴𝑘 = 1

𝑘

If the initial state is of distribution 𝑇 = 𝑇𝑖, an HMM can be written in a compact

notation to represent the complete parameter set of the model

𝜆 = (𝑀, 𝑁, 𝑇)

HMMs are widely used in speech, gesture recognition and signal processing systems. HMMs

provide the algorithm for modeling dynamical 3-D dependencies and correlations between

measurements. The dynamical dependencies are modeled implicitly by a Markov chain with a

specified number of hidden states. The initial state for an HMM can be determined by

estimating how many different states are involved in specifying a sign language. While better

results might be obtained by modifying different states for each sign, a four state HMM with

one skip transition was determined to be sufficient for this task [29] (Figure 4.2).

15

Figure 4.2: the four states HMM used for recognition [29].

Schlenzig et al. [31] used hidden Markov models to recognize “hello," good-bye," and “rotate"

in sign language.Wilson and Bobick [32] explored incorporating multiple representations in

HMM frameworks, and Campbell et. al. [33] used a HMM-based gesture system to recognize 18

T'ai Chi gestures with 98% accuracy.

5.0 Proposed Methodology

This report propose an inexpensive and light weight tracking device which is influenced by B.

Dorner*34+ and Robert’s work *35+. An experiment was set to validate the facts and the data of

this report. The principal method is to infer a pose from a still frame of the hand wearing a

color dotted glove. The glove is designed in a way so that this inference task searches that pose

in library database. The library database is generated by sampling records of natural hand poses

and indexed by rasterizing images of the poses.A (noisy) input image from the camera is first

transformed into a normalized query. It is then compared to each entry in the database

according to a robust distance metric. An evaluation of our data-driven pose estimation

algorithm would show a steady increase in retrieval accuracy with the size of the database. [33]

5.1 Glove Design

The glove design is adequately unique that the inference of the pose of a hand can be acquired

from a single frame captured by a consumer grade camera. The glove is made of simple

transparent polymer. The glove has 16 bright orange (hexadecimal code #FFA500 ) colored

patches at the back, 16 lime (hexadecimal code #00FF00 )colored patches at the front and 5

magenta colored (hexadecimal code #FF00FF) patches at the tip of the finger. The system only

http://www.computerhope.com/cgi-bin/htmlcolor.pl?c=FF8040

http://www.computerhope.com/cgi-bin/htmlcolor.pl?c=00FF00

http://www.computerhope.com/cgi-bin/htmlcolor.pl?c=FF00FF

16

looks for this three (#FFA500, #00FF00, #FF00FF ) fully saturated colors to distinguish front,

back side and the fingertips of the hand.#FFA500, #00FF00, #FF00FF has been classified as

master color. The color patches and the pattern on the glove enables quicker and maximum

robust pose estimation with less complex color identification algorithms [36].Orange and lime

patches are connected at side of each fingers which enables us to easily distinguish the side of

the finger. The 3D hand model has 21 degree of freedom (DOF) including 6 DOFs for global

transformation and 4 DOFs per finger.

Back view of glove Front view of glove

Side view of glove

Figure 5.1: Glove Design







17

5.1.1 Glove vs. Bare Hand tracking

In bare-hand pose estimation, two very different poses can map to very similar images. This is a

difficult challenge that requires slower and more complex inference algorithms to address. An

extra step needs to be acquired to obtain the skin data (edge detection)for good results [37].

With gloved hand, very different poses always map to very different images (See Figure 3). This

allows us to use a simple image lookup approach.

Figure 5.1.1: Bare hand estimation and edge detection [17]

5.2 Rasterizing the frame

Typical consumer webcam has 30Hz to 60 Hz refresh rate and 24 to 30 frame per second

capturing ability. An experiment was set up and Sony® Visual Communication 2.0 web-camera

(30fps @ 60Hz refresh rate) was used as a webcam. The video was captured with iPiRecoder®.

The captured video is the collection of captured frame sequences (Ψ).Bilateral filter is used to

reduce noise and to smooth each frame image. Each frame of Ψis then rasterized using Adobe

Image Processor®.

5.3 Color and pixel correcting

The rasterized frames are then converted into primary pixel set Ω, where Ω is a set of 100x100

pixel images. The system will only recognize three distinct colors using color pixel classification;

magenta, lime and orange. Due to light ambience, webcam capturing sensor sensitivity,

converting image format quality, image hue and shadow the captured frame image might lose a

18

significant amount color pixels. To solve this problem all neighbor color close to magenta, lime

and orange would beclassified as glove pixel (Table 5.3). The system would reject any other

color from the frame including the background color. For maximum result, using three colors

other than the glove should be prohibited in the visual area.

Table 5.3: Estimating neighboring color

Master color Neighboring colors

#FFA500 (RGB decimal 255,

165, 0)

Accept all colors ranging from

RGB (205~255, 130~180, 0)

#00FF00 (RGB decimal 0, 255,

0)


RGB (0~90, 170~255, 0~90)

#FF00FF (RGB decimal 255, 0,

255)


RGB

(170~255, 0~80, 170~255)

After color pixel classification, only two pixel remains; glove pixel and nonglove pixel. Glove

pixels would be cropped and decreased into 40x40 pixel micro images. Let µ denote as Micro

Image Setand it would be classified as hand region. Once the hand region is acquired, it will be

queried with the library database for positive match. Decreasing the number of the pixel into

micro images would optimize further speed querying for the positive match.

5.4 Indexing the library database

The library database is produced sampling all 40x40 pixel natural hand configuration, sign

language alphabet and common hand gestures. This database would be used as a reference

database. An enriched database that covers all natural hand gestures helps the system

toperformeffectively in retrieval accuracy in terms of gestures configuration [34]. In the

experiment, a set of 1000 finger configuration D was sampled using iPiMoCap® system.




19

Members of the D is denoted as d1, d2, d3... dn (n is natural number). A distance metric between

dmand dnis denoted as s(dm,dn). Low-dispersion sampling was used to create a uniform set of

samples D from overcomple collection of finger configurations Ω. A sampling algorithm *35+ is

used to minimize dispersion at each iteration successfully,

The next furthest distance from previous sample 𝑚𝑖+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑚∈Ω𝑚𝑖𝑛𝑛∈𝐷𝑖s(d𝑚 , d𝑛)

Where, 𝐷𝑖 is given samples at i iteration.

5.5 Matching and Tracking

The tracking is done between the frames. The centroids of each of the visible colored patches in

the rasterized frame sequence (Ψ) pose would be calculated. The system would identify the

closest vertex to each centroid. The displacement of each centroid from moving hand then

calculated from the difference between two consecutive frames. Correspondence is then

established betweencentroids from each frame.

5.6 Pose estimation and finding nearest neighbor

The nearest neighbor pixel is found calculating distance metric between two micro images. Only

Hausfroff distances are counted among the pixel points to get a precise divergence. Each micro

image searches the library database for the positive ambiguity. To complete the process, each

micro image and the database images are compared. The divergence from the database to the

query and from the query to the database is calculated. Foreach non-background pixel in one

image, the distance is penalized to the closest pixel of the same color in the other image.

Given Distance metric [33] 𝑠(µ1, µ2) = 1

𝐴1 𝑚𝑖𝑛 𝑢 ,𝑣 ∈𝑈𝑥𝑦 𝑢−𝑥 2+(𝑣−𝑦)2(𝑥 ,𝑦)∈𝐴1

𝑈𝑥𝑦 = (𝑢, 𝑣) -µ1(𝑥 ,𝑦) = µ2(𝑢 ,𝑣)

𝐴1 = (𝑥, 𝑦) -µ1(𝑥 ,𝑦) ≠ 𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 𝑝𝑖𝑥𝑒𝑙𝑠

𝑠 µ1, µ2 = 𝑠 µ1, µ2 + µ1, µ2

20

Figure 5.6: Hausdorff-like image distance. A database image and a query image are compared

by computing the divergence from the database to the query and from the query to the

database. [33]

5.7 Blending nearest neighbor

In order to maximize smooth tracking, n amount of closest pixels are chosen to blend with the

background pixel. Blending the a certain amount of neighboring pixels in addition to pose

estimation helps to find the distance and thus to calculate the motion more accurately. Let ϒ = a

set of blended ten closest micro images, ϒ is calculated with a Gaussian radial basis kernel [33],

𝑑𝑝 µ = 𝑑1exp(−

𝑠 μ1 ,𝜇 2

𝜎2 )𝑖∈ϒ

exp(−𝑠 μ1 ,𝜇 2

𝜎2 )𝑖∈ϒ

Where 𝜎 is chosen to be the average distance to the neighbors.

5.8 Recognizing ASL

Once the hand region and is determined, pose estimation is completed and nearest neighbors

are blended, the system will query the database library for the positive match. The result from

the experiment are shown in figure 5.8, hand tracking is enabled to recognize sign language

alphabets (J and Z are not shown).

21

Sign

Alphabet

Captured

Frame

Image de-

noising

and

normaliza

tion

Database

candidate

match

(Micro

Image)

Sign

Alphabet

Captured

Frame

Image de-

noising

and

normaliza

tion

Database

candidate

match

(Micro

Image)

23

Figure 5.8 Interpretation of sign language alphabets.

6.0 Discussion

The proposed method of hand tracking system is a combination and correlation of optical and

glove tracking. Although, LED markers, magnetic sensors, silhouette analysis and acoustic

tracking exposes robust and smooth tracking and they are used widely in automation, medical

and entertainment industry, the proposed idea restricted use of this technologies because they

require more sophisticated algorithms, expense and time. Thus, it makes completely affordable

by consumers. A detailed comparison of technical data of all hand tracking systems available

along with the proposed prototype system is given in table 6.1

Table 6.1: Technical details comparison [7], [8], [9], [10], [11], [5].

Device/Syst

em name

Tracking

system

Retail price/

manufacturin

g expense

Camera/

sensor

used/DOF

Speed Weight Resolution/

area of

coverage

trakSTAR® Magnetic

Tracking

$50,000.00 6DOF 1000 fps 2 kg 6

Megapixel

Vicon® Optical and $30k-150k 10-24 1000 fps 6kg 1280x1024

24

optical

system

marker

tracking

camera pixel

OptotrakCe

rtus®

Marker

system,

optical

tracking

$70,000.00+ 540 camera

8 sensor

positions

900 18kg+3.4

kg

(system

control )

Capture

region 3x4

meter

marker 512

Ascension

(Polhemus®

)

Magnetic

Tracking

$50,000.00 18 sensors,

six dof’s

120 fps 1.8 kg 3m capture

area

Exoskeleto

n®s (joint

sensors

plus a

gyroscope)

mechanical

system

Electromag

netic

tracking

$40,000.00 180 sensors 500

sample/se

c

5 kg No range

limit, wired

system

Mattel

Power

Glove

Acoustic

tracking

$10,000.00 6 DOF 120

sample/se

c

1.1 kg 2 m radial

aera

Digital Desk

[5]

Optical

tracking,

silhouette

analysis

$5,000.00 2 camera 30fps - Less than 1

m

Videoplace Optical

tracking,

silhouette

analysis

$2000.00 1 camera 24-60 fps - Less than 1

m

MIT Led

glvoe

LED glove

technology

- 16 DOF 100~120

sample/se

0.8 kg -

25

c

Cyber

Glove

22 thin foil

strain

gauges

sewn into

the fabric

glove to

track,

Electromag

netic

$5000.00+ 22 thin foil 300

sample/se

c

0.5 kg

Boujou

silver bullet

Optical

tracking

$40,000.00 10 camera Full frame

120 fps

2 kg 16

megapixel

Proposed

prototype:

Color

dotted

glove

Marker

and Optical

tracking

$105.00

(glove+

webcam+

capturing

software)

1 camera,

glove has 21

DOF

30 fps at

60Hz

refresh

rate

Glove

weights

100

grams

0.1 meter

at 640x360

pixel

A detailed comparison of advantage and limitations are given in table 6.2.

Table 6.2: Comparison of advantage and limitations[7], [8], [9], [10], [11], [5].

Device/ System Name Advantage Limitations

trakSTAR® High rate data,

Highly available in industry

Expense, Occlusion

Vicon optical system (Motion

analysis)

High rate data,

Highly available in industry

Expense, Occlusion, relies on

software for data resolution

OptotrakCertus® Minimum data loss after Low capture rate, small region

26

occlusion identifies, no

environment restriction

Ascension (Polhemus®) No occlusion, orientation

information recorded

Environment restriction, can

be bulky

Exoskeletons (joint sensors

plus a gyroscope) mechanical

system

Fits a rigid body skeleton well,

high data rate

Not accurate in body location

VPL Data Glove Reasonable cost Slow speed capturing

Sayre Glvoe Effective for multi-functional

control

Less gesture

Digital Data enry glove First ASL recognizer Slow processing time

Cyber Glove m Virtual

Technologies. It is

comfortable, easy to use, and

has an accuracy

and precision well suited for

complex gestural work or fine

manipulations

Vicon MX Precise and accurate tracking

Proposed prototype: Color

dotted glove

Light weight, comfortable,

faster than using HHM or

recurrent neural network

because it queries in

database for positive match

rather than processing

topologies.

Slow estimation process time,

limited accuracy due to

inadequate library database

27

Although the proposed system has slow estimation response time because of the rapid access

to database for every frame, it managed to show credibility in respect of expense and

complexity.

7.0 Conclusion

This report introduced a hand-tracking user-input device composed of a single camera and a

polymer glove. The report shows that without using HMM or recurrent neural network, this

system can work effectively. The system is logically balanced and should work effectively in 3-D

manipulation and pose recognition tasks. The system could be improved by installing fine

sensors and inverse kinematics algorithms, but that would restrict the idea of being cost

effective. Because the primary purpose of this report is to deliver a robust and low-cost-user

input.

8.0 Recommendation: The proposed system bears more possible extensions. More cameras can be installed for more

accuracy as long as the hands do not occlude. Our hand movement and finger configuration can

be replaced with LED pens or multi touch interfaces for ease of user experience.

Inverse kinematics [25] and optimal smoothness [33] can be applied for more accuracy of the

detection and tracking system. Camera calibration process can be improved with better sensor

alignment and resolution. The system can also be used in the field of virtual surgery, virtual

games and sports alongside recognition of sign language.

28

References:

1. Judith Holt, Sue Hotto and Kevin Cole, Demographic Aspects of Hearing Impairment:

Questions and Answers, Third Edition, 1994.

2. Karen Nakamura, About ASL, Deaf Resource Library, http:// www.deaflibrary.org.

3. Heinlein, Robert A. , "Science fiction: its nature, faults and virtues", The Science Fiction

Novel, Chicago: Advent, 1959.

4. G.J. Grimes, "Digital Data Entry Glove Interface Device.”, Bell Telephone Laboratories,

Murray Hill. NJ, US Patent 4.414.537, Nov.8.1983.

5. Sturman, D.J., Zeltzer, D. "A survey of glove-based input”, IEEE Computer Graphics and

Applications, (January 1994).

6. J. Rehg, T. Kanade,DigitEyes: Vision-Based Human Hand-Tracking, School of Computer

Science Technical Report CMU-CS-93-220, December 1993.

7. Herman J. and Woltering, Optotrak, Selspot, Gait Measurement in Two-and Three-

Dimensional Space—A Preliminary Report, Cleveland 1994.

8. OptotrakCertus® Motion Capture System, available at:

http://www.ndigital.com/lifesciences/certus-techspecs.php.

9. ViconMX®, available at: http://www.vicon.com/products/sensors.html.

10. Myron Krueger. Artificial Reality 2, Addison-Wesley Professional, 1991.

11. Pierre Wellner, Interecting with paper on the DigitalDesk, Rank Xerox EuroPARC,

Cambridge, UK, Volume 36 Issue 7, Pages 87-96, July 1993.

12. Jannick P. Rolland, YohanBaillot, and Alexei A. Goon, A Survey Of Tracking Technology

For Virtual Environments, Center for Research and Education in Optics and Lasers

(CREOL), University of Central Florida.

13. Polhemus FASTTRAK® official website available at:

http://www.polhemus.com/?page=Motion_Fastrak

14. Ascension Technologies trakSTAR® official website available at http://www.ascension-

tech.com/medical/trakSTAR.php

15. Logitech® video technologies, http://www.logitech.com/en-us/488/455

29

16. A.G.E. Tech,Abrams Gentile Entertainment, 2009.

17. Vitor F. Pamplona, Leandro A. F. Fernandes, JoãoPrauchner, Luciana P. Nedel and

Manuel M. Oliveira, The Image-Based Data Glove, Proceedings of X Symposium on

Virtual Reality (SVR'2008), João Pessoa, 2008. Anais do SVR 2008, Porto Alegre: SBC,

2008, pp. 204–211.

18. Dr. G. Grimes, Digital Data Entry Glove, US Patent 4,414,537 Patented Nov. 8, 1983

19. Tom Zimmermann et al, Dataglove: A hand gesture interface device, 1985.

20. Ken Pimentel, Kevin Teixeira, "Virtual Reality: through the new looking glass",

Intel/Windcrest/McGraw Hill,1993.

21. L. Campbell, D. Becker, A. Azarbayejani, A. Bobick, and A. Pentland, “Invariant features

for 3-D gesture recognition," Intl. Conf. on Face and Gesture Recogn., pp. 157-162, 1996

22. Y. Cui and J. Weng,“Learning-based hand sign recognition." Intl. Work. Auto. Face Gest.

Recog. (IWAFGR),, p. 201-206, 1995.

23. Thad Starner, Joshua Weaver, and Alex Pentland, A Wearable Computer Based American

Sign Language Recognizer, The Media Laboratory, Massachusetts Institute of

Technology, 2001.

24. Marcus Vinicius Lamar, “Hand Gesture Recognition using T-CombNET A Neural Network

Model dedicated to Temporal Information Processing,” Doctoral Thesis, Institute of

Technology, Japan, 2001.

25. AnkitChaudhary, J. L. Raheja, Karen Das, and Sonia Raheja. (2011, Feb). “Intelligent

Approaches to interact with Machines using Hand Gesture Recognition in Natural way A

Survey,” International Journal of Computer Science & Engineering Survey (IJCSES), vol.

2(1).

26. Jian-kang Wu, “Neural networks and Simulation methods,” Marcel Dekker, Inc., USA,

1994. Available

at:http://books.google.co.in/books/about/Neural_networks_and_simulation_methods.

html?id=95iQOxLDdK4C&redir_esc=y

30

27. ManarMaraqa, Raed Abu-Zaiter, “Recognition of Arabic Sign Language (ArSL) Using

Recurrent Neural Networks,” IEEE First International Conference on the Applications of

Digital Information and Web Technologies, p. 478-48, 2008.

28. Tie Yang, YangshengXu, Hidden Markov Model for Gesture Recognition, May 1994.

29. Thad Eugene Starner, Visual Recognition of American Sign language Using Hidden

markov models, 1999.

30. J. Schlenzig, E. Hunter, and R. Jain, “Recursive identification of gesture using hidden

Markov models." Proc. Second Ann. Conf.on Appl. of Comp. Vision, p. 187-194, 1994.

31. A. Wilson and A. Bobick. “Learning visual behavior for gesture analysis." Proc. IEEE

Int'l.Symp. on Comp. Vis, Nov. 1995.

32. C.Y. Suen, M. Berthod, and S. Mori, “Automatic recognition of handprinted characters:

the state of the art,” Proceedings of the IEEE, Vol. 68, No. 4, pp. 469-487, 1980.

33. B. Dorner,Chasing the colour glove: visual hand tracking, 1994.

34. Robert Y. Wang, Jovan Popovic, Real-Time Hand-Tracking with a Color Glove, 2009.

35. White, R., Crane, K., And Forsyth, D. A., Capturing and animating occluded cloth, ACM

Transactions on Graphics, 2008.

36. M. Yuan, F. Farbiz, C.M. Manders, T.K. Yin., “Robust hand tracking using a simple color

classification technique”, The International Journal of Virtual Reality, 8(2), 2009.

Sign Recognition by Hand Tracking

Documents

Transcript of Sign Recognition by Hand Tracking