A Gesture Recognition System using Smartphones€¦ · Gesture - A gesture is a sequence of...

A Gesture Recognition System using Smartphones

Luís TarratacaInstituto Superior Técnico

Av. Prof. Dr. Aníbal Cavaco SilvaLisbon, Portugal

[email protected]

ABSTRACTThe need to improve communication between humans andcomputers has been instrumental in defining new modalitiesof communication, and accordingly new ways of interactingwith machines. Humans use gestures as a means of com-munication. The best example of communication throughgestures is given by sign languages. With the latest genera-tion of smartphones boasting powerful processors and built-in video cameras it becomes possible to develop complexand computational expensive applications such as a gesturerecognition system.

We present a gesture recognition prototype system for smart-phones. We describe two segmentation processes, based onRGB and HSI color spaces, used to distinguish betweenforeground and background objects and conclude that RGBbased rules outperform those of HSI. A model based repre-sentation alongside a template based technique were devel-oped for hand posture classification. Comparison betweenmethods evidences the latter superior recognition rates. Weemployed hidden Markov models to model the sequences ofhand postures that form gestures. Hidden Markov modelsproved to be an efficient tool for dealing with uncertaintydue to their probabilistic nature.

We trained a set of three gestures and based on user interac-tion obtained an average recognition rate of 83.3%. We con-cluded that the latest smartphone generation is capable ofexecuting complex image processing applications, with themost penalizing factor being camera performance regardingcapture rates.Keywords:Development cycle, hidden Markov models, pat-tern recognition, skin detection, smartphone performance,system architecture.

1. INTRODUCTION1.1 MotivationThe need to improve communication between humans andcomputers has been instrumental in defining new modalities

of communication, and accordingly new ways of interactingwith machines. Sign language is a frequently used tool whenthe use of voice is impossible, or when the action of writingand typing is difficult, but the possibility of vision exists.Moreover, sign language is the most natural and expressiveway for the hearing impaired. Gestures are thus suited toconvey information for which other modalities are not effi-cient.

When it comes to gesture recognition and human-computerinteraction there is quite a lot of scientific work developed,but before proceeding it is important to distinguish betweengesture and posture. This distinction, as well as the defini-tion of both terms can be found in [Davis and Shah, 1994],namely:

• Posture - A posture is a specific configuration of handflexion observed at some time instance;

• Gesture - A gesture is a sequence of postures con-nected by motions over a short time span. Usually agesture consists of one or more postures sequentiallyoccurring on the time axis.

With the latest generation of smartphones boasting powerfulprocessors and built-in video cameras, the development ofcomplex and computational expensive applications becomesachievable. Also, by developing gesture recognition softwarefor smartphones it allows one to study the feasibility of suchan interaction process.

1.2 GoalsThe main goals of this paper focus on the areas of computervision, pattern recognition and their applicability to embed-ded systems such as smartphones. These goals can be statedas follows:

• Develop algorithms that allow the recognition of handpostures, by utilizing a predefined set of hand postures;

• Develop algorithms that based on the hand posturerecognized allow the recognition of a simple gesture-based language;

• Implement the selected algorithms on a smartphonewith a built-in camera;

• Study and analyze the capabilities of smartphones toexecute computationally intensive algorithms;

• Analyze the performance of the algorithms accordingto their efficiency and processing time;

• Study and analyze the capabilities of smartphones asimage acquisition and processing units;

• Study the potential for code optimizations in order tospeed-up the performance of the algorithms.

1.3 Problem StatementThe recent proliferation of latest generation smartphonesfeaturing advanced capabilities (such as being able to runoperating systems, adding new applications to better serveits users and multimedia features such as video and imagecapture) has allowed for the development of a compact andmobile system with many features that a few years ago be-longed to the realm of a normal computer.

Yet due to their size and battery requirements even today’smost evolved models have constraints not usually found on aregular desktop computer. This is especially true when con-sidering that such devices are usually resource constrained inboth memory and CPU performance. Since sign languageis gesticulated fluently and interactively like other spokenlanguages, a gesture recognizer must be able to recognizecontinuous sign vocabularies and attempt do it in an effi-cient manner.

A vision based approach to gesture recognition can be re-alized by using the cameras that are included in most oftoday’s smartphones and converting them into smart cam-eras. Smart cameras are cameras that include image pro-cessing algorithms. These algorithms might, for example,extract some features of images. In this case the data struc-tures representing feature information are transmitted to thecomputation system’s core, in order to be properly analyzed.One of the tasks of a smart camera might be the recogni-tion of gestures identifying a command to be executed andproperly notifying the system’s core.

By developing an approach considering the real-time re-quirements associated to gesture recognition and respectiveconstraints associated to embedded systems, such as smart-phones, it becomes possible to study any potential issuesthat may arise.

1.4 OrganizationThe remaining sections of this work are organized in accor-dance with this gesture recognition overall system architec-ture.

Section 2 presents the different stages relating to the smart-phone component, namely:

• Section 2.1 presents the segmentation process, denom-inated by ”Skin Detection”. In this process a thresh-olding operation is realized in order to allow for dif-ferentiation of background and foreground. This wayit becomes possible to label sections of an image asbelonging or not to a hand posture.

• Section 2.2 focuses on feature extraction algorithmsthat allow the determination of points of interest ina posture. Once the features describing a posture areobtained, they can be compared by a classification al-gorithm against a labeled database containing featuresof a sample posture population. The approaches takentoward feature extraction as well as posture classifica-tion and respective issues are presented in this section.

• Section 2.3 focuses on the methods developed and re-spective issues surrounding the use of hidden Markovmodels applied to gesture recognition. Once a posturehas been labeled it needs to be contextualized withthe remaining postures of a gesture in order to providefor semantic meaning in a probabilistic fashion, this iswhere the hidden Markov models come into play.

Section 3 discusses the approach taken towards system im-plementation namely the development cycle emplyed, as wellas core components of the system architecture.

Section 4 focuses on the main experimental results affectingthe gesture recognition system developed, such as overallsmartphone and system performance.

2. GESTURE RECOGNITION2.1 Skin DetectionSkin color has proven to be a useful and robust clue for ges-ture recognition, face detection, localization and tracking.Color allows for fast processing and experience suggests thathuman skin has a characteristic color, which is easily recog-nized by humans. In the scientific literature several authorsprovide an excellent introductory base for the topic of skindetection and respective issues see [Jones and Rehg, 2002],[Kovac et al., 2003] and [Vezhnevets et al., 2003].

When color is used as a feature for skin detection it is nec-essary to choose from a wide range of possible color spaces,including RGB, normalized RGB, HSI and YCrCb, amongothers.

2.1.1 RGB-based rulesDue to it is huge popularity and wide adoption in the elec-tronics industry, namely digital cameras and computer mon-itors, the RGB color space presents itself as an adequatecandidate to be used in color-based skin detection. A de-tailed analysis of the properties of the RGB color space ispresented in [Yamaguchi, 1984].

Using the RGB model a simple pixel based heuristic is pre-sented in [Kovac et al., 2003] that determines whether a cer-tain pixel of an input image corresponds to the skin color.With the algorithmic rules of Figure 1, it becomes possibleto label the pixels of an image as being skin or not. Theobvious and main advantage of this method is the simplic-ity of the set of skin detection rules that leads to the con-struction of a fast classifier. On the other hand, the maindifficulty achieving high recognition rates with this methodis the need to find both good color space and adequate deci-sion rules empirically. These rules are also subject to beinghighly influenced by the skin samples used and also by thelighting conditions of each of them.

Algorithm SKIN-DETECTION (I)Input: I = an array containing the RGB encodings for all

pixelsOutput: O = an array containing the classification for each

pixel of I1. let m be the size of I2. let R be the red component of a pixel3. let G be the green component of a pixel4. let B be the blue component of a pixel5. for i ←1 to m6. do if R > 95 AND G > 40 AND B > 20 AND7. MAX(R,G,B) - MIN(R,G,B) > 15

AND8. |R−G| > 15 AND9. R > G AND R > B10. then WHITE( O[i] )11. else BLACK( O[i])12. return O

Figure 1: Rules describing the skin cluster in theRGB color space at uniform daylight illumination.

There are however, according to [Vezhnevets et al., 2003],some aspects of the RGB color space that do not favor itsuse, namely the high-correlation between components andthe mixing of chrominance and luminance data. These fac-tors are important when considering the changing natureof color under different illumination conditions. The RGBcolor space has been used by innumerous authors to studypixel-level skin detection such as [Jones and Rehg, 2002] and[Brand and Mason, 2000].

2.1.2 HSI-based rulesDue to the different perception of color under uncontrolledlight conditions it becomes useful to consider other colorspaces. The Hue Saturation Intensity is a color model basedon human color perception with explicit discrimination be-tween luminance and chrominance. The explicit discrimina-tion between luminance and chrominance properties of theHSI model presented a strong factor in the choice of a colorspace as can be seen in the work related to skin color segmen-tation presented in [Zarit et al., 1999] and [Mckenna et al.,1998]. However, [Poynton, 1995] points out several undesir-able features of these color spaces, including hue discontinu-ities, and the computation of brightness which conflicts withthe properties of color vision.

Several authors have presented work (see [Lin and Chen,1991, Zhang and Wang, 2000]) regarding the use of HSI incolor image segmentation. In this work the segmentationtechnique used, which is presented in Figure 2, only takesinto account the Hue component of the HSI color space (as in[Bonato et al., 2005]), allowing for a faster computation andalso limiting the amount of computation resources required.

In this case H(i, j) represents the Hue values of a certainpixel and T1 and T2 describe inferior and superior thresholds.The thresholds values are required for determining when apixel belongs to a region of interest and they should takeinto consideration the skin locus on the HSI color space. Theresulting image contains a binary representation. An RGB-

HSI conversion mechanism is presented in [Gonzalez andWoods, 1992], with expression 1 illustrating how to calculatethe hue value.

H = cos−1

2664 23

`r − 1

3

´− 1

3

`b− 1

3

´− 1

3

`g − 1

3

´r23

h`r − 1

3

´2+`b− 1

3

´2+`g − 1

3

´2i3775 (1)

Algorithm SKIN-DETECTION (I)Input: I = an array containing the RGB encodings for all

pixelsOutput: O = an array containing the classification for each

pixel of I1. let m be the size of I2. let H be the hue value of a pixel3. let T1 be a minimum threshold value4. let T2 be a maximum threshold value5. for i ←1 to m6. do H ←CONVERT TO HUE( I[i] )7. if T1 <= H <= T2

8. then WHITE( O[i] )9. else BLACK( O[i] )10. return O

Figure 2: Rules describing the skin cluster in theHSI color space.

2.1.3 RGB- vs. HSI-based rulesRGB is one of the most widely used color spaces for process-ing and storing digital image data such as those acquiredby digital cameras. The absence of direct support for theHSI color space requires a transformation process, from thehardware provided color space to HSI. This situation canpotentially harm the use of HSI.

In order to evaluate the performance of the skin detectionprocess in both HSI and RGB color spaces, a total of 850 im-ages were served as input to both skin detection algorithms.The resulting binary images were then compared against theactual skin positions allowing for the determination of a pre-cision value. HSI-based rules obtained an average precisionof 38% against a value of 82% for the RGB-based rules.

2.2 Posture RecognitionIn many practical problems, such as posture recognition,there is a need to make some decision about the content ofan image or about the classification of an object that it con-tains. The basic approach toward object classification viewsit as a vector of measurements or features. A d-dimensionalfeature vector x typically represents the object to be clas-sified. The classification process might actually fail, eitherbecause the posture’s image contains noise, or a new handposture that the system does not know was presented to it.

The remaining sections deal with the approaches developedtoward feature extraction, respectively the Convex Hull andSimple Pattern Recognition approaches.

2.2.1 Convex hull methodAccording to [Cormen et al., 2003] the convex hull of a setQ of points is the smallest convex polygon P for which eachpoint in Q is either on the boundary of P or in its interior.For convenience we denote the convex hull of Q by CH(Q).Intuitively, one can think of each point in Q as being a nailsticking out from a board. The convex hull is the shapeformed by a tight rubber band that surrounds all the nails.For each hand posture it is possible to extract a number ofuseful features such as:

• The number of vertices identified;

• The x-coordinate and y-coordinate of each vertex;

• Polar angles and euclidean distance of each vertex withrespect to an anchor point;

Feature comparison can be performed by calculating the eu-clidean distance between them. Graham’s Scan, presentedin Figure 3, (see [Graham, 1972] and [Cormen et al., 2003])is an algorithm that computes the convex hull of a set of npoints. It outputs the vertices of the convex hull in counter-clockwise order and runs in O(n log n).

Algorithm GRAHAM-SCAN (Q)Input: Q = input set of pointsOutput: S = a stack containing, from bottom to top, ex-

actly the vertices of ConvexHull(Q) in counterclockwiseorder

1. let p0 be the point in Q with the minimum y-coordinate,or the leftmost such point in case of a tie

2. let < p1, p2, ..., pm > be the remaining points in Q,sorted by polar angle in counterclockwise order aroundp0 (if more than one point has the same angle, removeall but the one that is farthest from p0)

3. PUSH( p0, S )4. PUSH( p1, S )5. PUSH( p2, S )6. for i ←3 to m7. do while the angle formed by points NEXT-TO-

TOP(S), TOP(S), and pi makes a non left turn8. do POP( S )9. PUSH( pi, S )10. return S

Figure 3: Graham’s scan algorithm.

The result of applying Graham’s scan to an input image canbe seen in Figure 4. Notice that Graham’s scan is only ap-plied after a segmentation process has distinguished betweenforeground and background, thus enabling the creation ofthe set Q that is received as an argument by the algorithm.

As expected, and illustrated by Figure 4(a), several verticesare present for each finger. Finger borders are not solelyconstituted by a single point and thus the convex polygonhas to incorporate several pixels surrounding the frontier ofeach finger. In order to solve this problem a simple clusteringalgorithm was employed which agglutinates vertices fallingwithing an error margin. Figure 4(b) illustrates the newvertices obtained after clustering has been performed.

(a) Before clus-tering.

(b) After cluster-ing.

Figure 4: Convex hulls before and after clustering.

2.2.2 Simple pattern recognition methodThe next approach towards posture recognition althoughrelatively simple, is substantially different from the previ-ous method. The main idea behind the applicability of theconvex hull was to obtain a worthy representation or modelof a hand posture. Rather than trying to model a postureone can try to obtain some representation of a scene andmatch it against a set of scenes belonging to a feature la-beled database.

Let S = {p1, p2, ..., pm} contain all skin-labeled pixels ofa certain scene and let each skin-labeled pixel pi = (x −coordinate, y − coordinate), as illustrated by Figure 5(b).Then, if one wishes to compare the set S with any other setS′ one simply has to calculate the number of positions thatdiffer between sets in order to obtain an indication of theproximity between scenes. A brute force approach wouldsimply check to see if every element in S and S′ is presentin the opposite set, for every element not found a simplecounter would be incremented and keep track of the respec-tive number of differences. This process would be ratherinefficient in terms of speed. It is also necessary to considerthat each pixel position has to keep x- and y-coordinates sothere also occurs a significant impact in terms of memoryusage.

A better tactic would be to store the position of skin-pixelsin a binary array, this way it becomes possible to slash mem-ory usage. This process has clear advantages regarding theoverkill method of storing coordinates in typical data typessuch as an integer that usually has a 32-bit precision. In factthe use of integers might suffice if their precision is sufficientto depict a scene. Consider for example that an image withresolution 32 × 24 is acquired, rather than having a binaryarray of dimension 768, one could encode each line using anunsigned 32-bit precision integer in the exact same way as abinary array. By employing binary arrays it becomes possi-ble to use the Hamming distance, which allows for efficientcomputation of array differences.

The procedure presented in Figure 6 is applied to an image ofresolution N ×M , where N represents the number of pixelsper row and M the number of rows. The encoding algorithm

takes as input a binary array I representing the scene whichcontains the classification of each pixel and produces a listcontaining for each row the respective encoding obtained.The algorithm runs in O(kn) where k is the number of rowsand n represents the amount of binary pixels per row.

(a) Posture Y. (b) 32 × 24 version ofPosture Y.

Figure 5: Two images depicting the transformationprocess from an image containing a posture (Figure5(a)) to a segmented image with a reduced resolu-tion (Figure 5(b)).

Algorithm ENCODING-ALGORITHM (I)Input: I = N×M binary array containing the classification

of each pixelOutput: S = a list containing, from bottom to top, exactly

the encoding for each line1. let encoding denote an auxiliary variable that repre-

sents a row encoding2. let offset denote an auxiliary variable that keeps track

of the offset of a pixel in the binary array3. for each row r ∈ I4. do offset ←05. encoding ←06. for each pixel p ∈ r7. do if PAINTED(p)8. then encoding ←encoding +

SHIFT-LEFT(1, offset)9. offset ←offset + 110. ADD(S, encoding)11. return S

Figure 6: Encoding Algorithm.

2.2.3 Classification performanceIn order to evaluate the performance of the methods devel-oped for hand posture recognition, namely the Convex Hulland Simple Pattern Recognition a total of 2023 images forfive postures were evaluated. Of these, a smaller set of 120images was randomly chosen to represent the training setand the remaining images were used to build the test set.

Regarding the choice of a classifier algorithm, one possiblemethod consists of labeling an unknown feature vector x intothe class of the individual sample closest to it. This is thenearest-neighbor rule (see [Shapiro and Stockman, 2001]).A better classification decision can be made by examiningthe nearest k feature vectors in the database. Our finalchoice, respectively the K-Nearest-neighbor classifier can be

effective even when classes have complex structure in d-spacefeature vectors and when classes overlap. The algorithm usesonly the existing training samples so no assumptions need tobe made about models for the distribution of feature vectorsin space.

In the end, the simple pattern recognition method success-fully classified 84% of the test images whilst the convex hullonly correctly identified 47% of the same test set.

2.3 Hidden Markov models applied to gesturerecognition

At its core a gesture is nothing more than a set of postures,which can be characterized as a signal. A problem of funda-mental interest in real-world processes producing observableoutputs consists in characterizing signals in terms of signalmodels [Rabiner, 1989]. The general principles of hiddenMarkov models were published by Baum and his colleagues(presented in [Baum and T., 1966], [Baum and Egon, 1967],[Baum and Sell, 1968], [Baum et al., 1970] and also [Baum,1972]). Rabiner’s and Dugad’s work on hidden Markov mod-els (HMMs), respectively presented in [Rabiner, 1989] and[Dugad and Desai, 1996], provide a detailed analysis, focus-ing on a methodically review of the theoretical aspects ofthis type of statistical modeling.

HMMs represent stochastic processes that are adequate fordealing with the stochastic properties of gesture recognition[Yang and Xu, 1994]. For this gesture recognition system,HMMs representing gestures were employed. The modelsand respective parameters were built from training data,with each model featuring a number of states equal to thenumber of postures present in each gesture. Each HMM con-structed can be evaluated against a given input set of pos-tures, in order to assess the likelihood of the model generat-ing that sequence. In order to determine which gesture waspresented to the system all HMMs are evaluated, with themodel presenting the highest probability being ultimatelyused for classification.

3. SYSTEM IMPLEMENTATIONMany embedded systems have substantially different de-sign constraints than desktop computing applications. Thecombination of factors such as cost sensitivity, long life cy-cle, real-time and reliability requirements impose a differentmindset that makes it difficult to adopt traditional computerdesign methodologies and tools to embedded systems. Thus,no single characterization applies to the diverse spectrum ofembedded systems [Koopman, 1996].

In the case of this gesture recognition system, the develop-ment cycle used is illustrated in Figure 7 and described next.Typical design methodologies start with a requirement cap-ture phase characterizing the functional and non-functionalrequirements of an application. This phase is usually fol-lowed by an application architecture defining the executionflow and requirements compliance responsibility associatedto each functional block. The combination of requirementscapture associated to the definition of an application archi-tecture, expressed as Stage 1 in Figure 7, resulted in thedefinition of the following functional cores for this gesturerecognition system:

1. Skin Detection Phase

In charge of image acquisition and respective data pro-cessing in order to distinguish between foreground andbackground and thus enabling the determination of im-age pixels that could be skin-labeled. This initial stagealso provides the desired input data to the PostureClassification Phase;

2. Posture Classification Phase

In charge of posture database creation, processing andmanipulation and whose main responsibility would beto classify a given input image provided by the SkinDetection Phase against a trained database. This in-termediate stage also provides the desired input datato the Gesture Classification Phase;

3. Gesture Classification Phase

In charge of gesture database creation, processing andmanipulation, namely the acquisition and training pro-cess of the hidden Markov models representing ges-tures. Besides these responsibilities, the Gesture Clas-sification Phase is also liable for determining to whichgesture a given set of postures belongs to.

Once the core functional blocks of the application architec-ture were defined work proceeded as follows:

• For each functional core a proof of concept was per-formed using Java Standard Edition (J2SE) representedas Stage 2 of Figure 7. A standard computer applica-tion was developed featuring all functional cores. Theapplication served as a simulator allowing for param-eters specification and result observation. If the proofof concept did not meet the requirements defined forthat functional block, then further refinement wouldbe performed on it.

Due to the different operational audience of J2ME,which targets resource constrained devices, not everyclass of the plentiful J2SE API exists in J2ME, thisis particularly true regarding data structure availabil-ity and also features such as built-in iteration mecha-nisms. With these limitations in mind and in order toensure successful migration between platforms, a setof strategic rules was devised, namely:

– Rule 1 - Only use API methods available in bothJ2SE and J2ME;

– Rule 2 - Substitute J2SE native data structuresby a combination of purpose-built data structures,Vector and Hashtable classes;

– Rule 3 - Guarantee that all purpose-built datastructures override the methods of the Enumera-tion interface;

– Rule 4 - Substitute all J2SE standard iterationmechanisms with calls to the Enumeration inter-face;

– Rule 5 - Exercise great care when allocating ob-jects and take all possible steps to avoid holdingreferences to objects longer than necessary, in or-der to allow the garbage collector to reclaim heapspace as quickly as possible.

• Each time a functional core was deemed as having anadequate behavior and performance it was migrated toa J2ME Mobile Information Device Profile (MIDP) ap-plication, also known as MIDlet, represented as Stage 3of Figure 7. The migration process required a carefulapplication adaptation effort as there would be por-tions of code, such as database creation and experi-mental tests, that were not necessary for MIDlet ex-ecution. Once the migration process was concludedthe MIDlet was tested on Sun’s Java Wireless Toolkit(WTK) 2.5.2 emulator in order to check for propercompliance within emulator specifications. Besides en-abling specification of memory restrictions, Sun’s WTKalso provides a series of tools in order to profile mem-ory usage and CPU performance, thus enabling bottle-neck detection. If the tests performed on the MIDletrevealed the existence of performance hampering bot-tlenecks, optimization techniques would be applied tothose portions of the code and further MIDlet reassess-ment would be made;

• Once the MIDlet achieved the desired behavior, the ap-plication would be loaded into smartphones and tested,expressed as Stage 4 of Figure 7. At this stage ofdevelopment, the main test concerns regarded perfor-mance issues such as algorithm execution, memory is-sues such as ensuring that the application did not runout of memory and also fine tuning parameters such ascamera resolution in order to obtain the best balancebetween speed and quality;

• If at any given time it was concluded that an imple-mentation strategy was not suited due to performanceissues, further refinement would be made in the previ-ous stages.

Figure 7: Development cycle characterization.

4. EXPERIMENTAL RESULTSIn the case of this work, besides the evident performanceissues regarding how many correct results were achieved bythis gesture recognition system it is also crucial to considerits core execution platform, i.e. smartphones. Keep in mindthat one of the main objectives of this work is to assesscurrent smartphone performance, considering not only com-putational power but also camera wise.

4.1 Smartphone PerformanceJ2ME grants access to a phone’s camera through the op-tional library Mobile Media API (MMAPI) (see [Sun, 2008])which provides audio, video and other multimedia supportto resource constrained devices. MMAPI enables Java devel-opers to gain access to native multimedia services availableon a given device thus enabling the development of multi-media applications from different phone manufacturers.

Applications such as this gesture recognition system whosemain focus is on pattern recognition rely heavily on imageacquisition and processing. If an image processing applica-tion is to be successful, our device must be able to executeheavy algorithms as well as acquire images at a significantacquisition rate. Factors like image processing and cameraperformance in J2ME should be properly analyzed has theycarry possible implications for the application [Tierno andCampo, 2005].

Two smartphones were employed for system testing, namelyNokia’s N80 and N95 models, which feature respectively 3and 5 megapixels cameras. Nokia’s N95 model boasts aTexas Instrument’s OMAP2420 processor whilst the N80model features an OMAP1710. The OMAP2420 featuresan ARM1136 processor core clocked at 330 Mhz whilst theN80’s OMAP1710 features an ARM926TEJ clocked at 220Mhz. For further information please refer to [TI, 2008b] and[TI, 2008a].

4.1.1 Camera performanceMMAPI enables video snapshot acquisition, outputing im-mutable instances of the javax.microedition.lcdui.Image class.MMAPI also allows one to obtain the RGB pixel values froman image in order to proceed with data processing. Thecombination of these three procedures corresponds to thesteps required in order to execute some useful computationover a given image and as such the combined total time canbe interpreted as representing the acquisition time [Tiernoand Campo, 2005]. Figure 8 illustrates the average acquisi-tion times obtained for Nokia’s N80 and N95 models usingMMAPI. For each resolution mode ten image processing iter-ations were performed (contemplating the three proceduresmentioned above). A total of 120 tests were performed.

As it is possible to see both models present relatively highacquisition times even for such low resolutions as 160× 120making it impossible to meet real-time requirements. Alsonoteworthy is the fact that both models present an acqui-sition time drop for the 640 × 480 resolution which seemsto provide consistency to the fact that this resolution repre-sents the standard operating mode for both cameras, withthe remaining resolutions being obtained as rescales [Tiernoand Campo, 2005].

It is also important to draw attention to the fact that froma smoothness operational perspective the N95 MMAPI im-plementation, regarding camera functionality, operated ina consistent manner throughout system development. Thesame could not be said for the N80 MMAPI implementa-tion which systematically and randomly raised exceptions,grinding system development to a halt.

Figure 8: Image acquisition times regarding imageresolution.

4.1.2 Low-level operations performanceRegarding J2ME processing performance in the N80 andN95 we measured the processor’s execution time for basiclow-level operations to determine the overall speed and toidentify the fastest and slowest operations. Each operationwas executed in a loop of 10.000.000 iterations and the totalexecution time of loop was measured. In order to counterthe effects of loop specific instructions, such as incrementsand comparisons, the total execution time of an empty loopwith the same number of iterations was measured. The du-ration of each operation corresponds to an average betweenthe time of first loop minus the empty loop, over the num-ber of iterations. An average measure was chosen becausethe application was executed concurrently with other pro-cesses running on the smartphone’s operating system. Thisprocedure was then applied to measure how long it took to:

• access an array

• increment an integer variable;

• add two integer variables;

• use bit shift operators to multiply and divide integersby a power of 2;

• compare two integer variables (equal, less or equal,less).

The results presented in Figure 9 reveal the times achievedfor each operation. In both smartphones models, division isthe slowest operation followed respectively by array extrac-tions, compare operators and finally arithmetic operations.Although the N80 and N95 feature different base ARM pro-cessors and one expected to see better performance fromthe latter it was still surprising to check that regarding theless or equal operation the N95 was more than three timesslower than the N80. Regarding the remaining operations itis possible to check a clear performance superiority from theN95.

Figure 9: Operation times regarding low-level arith-metic operations.

4.1.3 System execution time performanceIn order to assess the impact of each smartphone computingpower in the final gesture recognition system a total of 50hand posture classifications per smartphone were conducted.The hand posture classification was chosen as it conveys in-formation regarding the execution times associated with theposture labeling of each input image in an acquisition time-wise independent fashion. This situations contrasts with thesystem’s attempts at determining to which gesture a givenset of postures belongs. In order to do so it is necessary totake into account the total time between acquiring an ini-tial image and calculating the most probable gesture. Byanalyzing the time spent processing each image it becomespossible to exclude the acquisition time delay associated toeach camera and thus provide for a comparable measure.

The final system was deployed onto both smartphone mod-els. Table 1 illustrates the average execution times per handposture, as well as respective frequencies allowed for thosevalues, obtained for Nokia’s N80 and N95 models.

Nokia N80 Nokia N95Average Execution Time (ms) 4935 1036

Frequency (Hz) 0.202 0.965

Table 1: Average execution times per hand postureand frequencies.

The results obtained demonstrate a clear superiority, regard-ing execution times, for the N95 smartphone. In average,Nokia’s N80 model was over four times slower when com-paring with the N95 results. The N95 smartphone resultsallow for an input frequency of approximately a hand pos-ture per second.

If one considers a real-world application of a gesture recog-nition system for smartphones with interacting users thisvalue, of a hand posture per second, seems like a reason-able one to expect. On the other hand the N80 frequencyof 0.202 Hz is considerably lower and would not allow fora convincing user experience. The combination of MMAPIstability alongside better performance results were decidingfactors that contributed to our final choice of using Nokia’sN95 model as the smartphone component for final systemtesting.

4.2 ProfilesSun’s WTK, which was used during system development, in-corporates a profiler, enabling behavior measurement during

program execution, focusing on the frequency and durationof function calls. It outputs a statistical summary of theevents observed. It should be noted that different measure-ments along a timeline, will result in different profile obser-vations, as the contribution of different functions outweighthose of others.

Figures 10 and 11 illustrate the results obtained during theexecution of this gesture recognition system, after one ges-ture was introduced and then again after ten gestures werepresented to the system. As was to be expected intense com-puter vision algorithms claim the greatest share of the execu-tion time distribution, namely the Skin Detection and filtercomponents. Both these components start by representingover 75.34% and end up with 91.34%. This situation wasto be expected as both components represent pixel intensiveoperations that besides analyzing every pixel of an inputimage also conduct a neighborhood survey over filtered im-ages. Clearly, in order to tackle the system’s execution timeit is necessary to address the performance issues surround-ing these two components and to consider some optimizationtechniques.

Figure 10: Execution (1 Gesture).

Figure 11: Execution (10 Gestures).

4.3 System performance4.3.1 Experimental setup

In order to test this gesture recognition system we proceededwith the experimental setup demonstrated by Figure 12,which illustrates the following factors:

• Figure 12(a) - Due to a lack of scaling mechanismswithin this gesture recognition system, special care wastaken with the distance between smartphone and hand

gestures. A simple measuring tape was employed in or-der to ensure proper distance compliance between ex-perimental setup and the images employed for trainingset creation regarding hand postures;

• Figure 12(c) - The system was deployed onto Nokia’sN95 smartphone and gestures were presented to thesystem according to a previous established distance.During the duration of the tests the smartphone re-mained at a fixed position. Each time a posture wasprocessed an audio signal notified the user that a newposture could be introduced to the system;

• Figure 12(b) - Illustrates the capture view associatedwith the N95’s built-in camera. The image depicts anoffice room under natural lighting conditions and withan abundance of background objects. These factors,alongside not being a simple black background, allowthe image to be classified as a complex background.

(a) Exper-imentalSetup.

(b) TestExample.

(c) Cap-ture View.

Figure 12: Experimental Setup.

4.3.2 ResultsThree gestures classes were used for system testing and la-beled respectively as Gesture 1, Gesture 2 and Gesture 3.For each gesture class defined, 30 gestures instances per classwere presented to the system. Given the tree core gestureclasses defined and the 30 instances per class it becomespossible to deduce that a total amount of 90 gestures wereemployed during system testing. Figure 13 illustrates theprecisions obtained for the three gestures defined, the fol-lowing items should be pointed out:

• Gesture class three has a slight advantage over the re-maining classes. This was to be expected as the HMMmodel depicting this gesture presents a higher observa-tion probability, respectively 25%, over the remainingHMMs, respectively 6.25% and 12.5% for Gesture 1and Gesture 2. In this case, higher probability trans-lates directly into a more likely class three classifica-tion, despite of possible posture misclassifications. Ac-cordingly, most gesture misclassifications verified weredue to this higher probability, with the system erro-neously classifying gestures as belonging to class three.

• The misclassified gestures were due to errors intro-duced by significant posture variations that had a di-rect impact on hand posture classification. Every timethe system correctly classified individual hand posturesthe correct gesture would be returned. Otherwise thesystem would attempt to retrieve the most likely ges-ture.

• Correct gesture classifications strongly depended oncorrect individual posture classification. In this case,the number of correct posture hits is in accordancewith the results presented regarding posture classifica-tion.

Figure 13: Final system precision results.

Considering the precision obtained for the three gesturesit becomes possible to obtain an average precision value of83.3%.

5. RELATED WORKThis average precision value behaves moderately when com-pared with other gesture recognition systems, specially if oneconsider the far simpler techniques employed in this work,namely:

• [Jota et al., 2006] - Obtains average recognitions ratesaround 93% using two separate techniques. The firsttechnique recognizes bare-hands using its outer con-tours. The second approach is based on color markspresent on each fingertip that allow for hand trackingand posture recognition;

• [Liang and Ouhyoung, 1998] - A large vocabulary signlanguage interpreter using HMMs was developed usinga DataGlove to extract a set of features such as pos-ture, position, orientation and motion. The averagerecognition rate sustained by the experimental resultsis 80.4%;

• [Triesch and von der Malsburg, 2002] - A system forclassification of hand postures against complex back-grounds was developed employing elastic graph match-ing. The system obtains an average 86.2% correct clas-sification.

6. CONCLUSIONSThe tests performed to assess smartphone performance fo-cused on three dimensions, namely: camera, low-level arith-metic operations and system execution performance. Of thetests performed the most penalizing factor was camera per-formance with J2ME which exhibited extremely low capturetimes. The remaining tests, which basically attempted tomeasure each smartphone processing power, demonstratedthat the latest smartphone generation is able to executecomputationally expensive applications. Keep in mind thatthis gesture recognition system makes intensive use of imageprocessing, posture and gesture databases, pattern recogni-tion, graph algorithms and also probabilistic models.

7. REFERENCESBaum, L. (1972). An inequality and associatedmaximization technique in stastical estimation forprobabilistic functions of markov processes. Inequalities,3:1–8.

Baum, L. and Egon, J. (1967). An inequality withapplications to statistical estimation for probabilisticfunctions of a markov process and to a model forecology. Bull. Amer. Meteorol, 73:360–363.

Baum, L., Petrie, T., G., S., and Weiss, N. (1970). Amaximization technique ocurring in the stastical analysisof probabilistic functions of markov chains. Ann. Math.Stat, 41:164–171.

Baum, L. and Sell, G. (1968). Growth functions fortransformations on manifolds. Pac. J.Math., 27:211–227.

Baum, L. and T., P. (1966). Statistical inference forprobabilistic functions of finite state markov chains.Ann. Math. Stat, 37:1554–1563.

Bonato, V., Sanches, A. K., M.M., F., Cardoso, J. M.,Simoes, E., and Marques, E. (2005). A real time gesturerecognition system for mobile robots. In review.

Brand, J. and Mason, J. (2000). A comparativeassessment of three approaches to pixel-level humanskin-detection. Pattern Recognition, 2000. Proceedings.15th International Conference on, 1:1056–1059 vol.1.

Cormen, T., Leiserson, C., Rivest, R., and C., S. (2003).Introduction to Algorithms - Second Edition. The MITPress.

Davis, J. and Shah, M. (Apr 1994). Visual gesturerecognition. Vision, Image and Signal Processing, IEEProceedings -, 141(2):101–106.

Dugad, R. and Desai, U. (1996). A tutorial on hiddenmarkov models. Technical report, Indian Institute ofTechnology.

Gonzalez, R. C. and Woods, R. (1992). Digital ImageProcessing. Addison-Wesley Company.

Graham, R. (1972). An efficient algorith for determiningthe convex hull of a finite planar set. InformationProcessing Letters, 1:132–133.

Jones, M. J. and Rehg, J. M. (2002). Statistical colormodels with application to skin detection. InternationalJournal of Computer Vision, 46:81–96.

Jota, R., Ferreira, A., Cerejo, M., Santos, J., Fonseca,M. J., and Jorge, J. A. (2006). Recognizing handgestures with cali. EUROGRAPHICS, 1981:1–7.

Koopman, P. (1996). Embedded system design issues(the rest of the story). Computer Design: VLSI inComputers and Processors, 1996. ICCD ’96.Proceedings., 1996 IEEE International Conference on,pages 310–317.

Kovac, J., Peer, P., and Solina, F. (2003). Human skincolor clustering for face detection. EUROCON 2003.Computer as a Tool. The IEEE Region 8, 2:144–148vol.2.

Liang, R.-H. and Ouhyoung, M. (1998). A real-timecontinuous gesture recognition system for sign language.Proceedings. Third IEEE International Conference on,1:558–567.

Lin, X. and Chen, S. (9-11 Apr 1991). Color imagesegmentation using modified hsi system for roadfollowing. Robotics and Automation, 1991. Proceedings.,

1991 IEEE International Conference on, pages1998–2003 vol.3.

Mckenna, S., Gong, S., and Raja (1998). Modelling facialcolour and identity with gaussian mixtures. PatternRecognition 31, 12:1883–1892.

Poynton, C. A. (1995). Frequently asked questions aboutcolour.

Rabiner, L. R. (1989). A tutorial on hidden markovmodels and selected applications in speech recognition.Proceedings of the IEEE, 77:257–286.

Shapiro, L. G. and Stockman, G. C. (2001). ComputerVision. Prentice Hall.

Sun (2008). Mobile media api (mmapi); jsr 135.http://java.sun.com/products/mmapi/, Last visited inJuly 2008.

TI (2008a). High-performance: Omap1710.http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateId=6123&navigationId=11991&contentId=4670, Last visited inJuly 2008 ,.

TI (2008b). High-performance: Omap2420.http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateId=6123&navigationId=11990&contentId=4671, Last visited inJuly 2008.

Tierno, J. and Campo, C. (2005). Smart camera phones:Limits and applications. Pervasive Computing, IEEE,4:84–87.

Triesch, J. and von der Malsburg, C. (2002). Robustclassification of hand postures against complexbackgrounds. Automatic Face and Gesture Recognition,1996., Proceedings of the Second InternationalConference on, pages 170–175.

Vezhnevets, V., Sazonov, V., and Andreeva, A. (2003). Asurvey on pixel-based skin color detection techniques.

Yamaguchi, H. (Nov 1984). Efficient encoding of coloredpictures in r, g, b components. Communications, IEEETransactions on [legacy, pre - 1988], 32(11):1201–1209.

Yang, J. and Xu, Y. (1994). Hidden markov model forgesture recognition. Technical report, Carnegie Mellon,Robotics Institute.

Zarit, B., Super, B., and Quek, F. (1999). Comparison offive color models in skin pixel classification. Recognition,Analysis, and Tracking of Faces and Gestures inReal-Time Systems, 1999. Proceedings. InternationalWorkshop on, pages 58–63.

Zhang, C. and Wang, P. (2000). A new method of colorimage segmentation based on intensity and hueclustering. Pattern Recognition, 2000. Proceedings. 15thInternational Conference on, 3:613–616.

A Gesture Recognition System using Smartphones€¦ · Gesture - A gesture is a sequence of...

Documents

Transcript of A Gesture Recognition System using Smartphones€¦ · Gesture - A gesture is a sequence of...