Human Face and Gesture Recognition - University of...

Interactive Virtual Environments

Human Face and Gesture Recognition

Emil M. Petriu, Dr. Eng., FIEEEProfessor, School of Information Technology and Engineering

University of Ottawa, Ottawa, ON, Canadahttp://www.site.uottawa.ca/~petriu

May 2008

The paper discusses two body-language human-computer interaction modalities, namely the hand gesture and facial

expression, for intelligent space applications such as elderly care and smart home applications.

The paper discusses two body-language human-computer interaction modalities, namely the hand gesture and facial

expression, for intelligent space applications such as elderly care and smart home applications.

• Qing Chen, “Real-Time Vision-Based Hand Tracking and Gesture Recognition,” Ph.D. Thesis, 2008.

• Marius D. Cordea, “A 3D Anthropometric Muscle-Based Active Appearance Model for Model-Based Video Coding," Ph.D. Thesis, 2007.

Hand gestures represent a powerful non-verbal context-dependent human communication modality. The expressiveness of hand gestures can be explored to achieve natural human-computer interactions in a smart habitat environment. We discuss vision-based hand tracking and gesture classification, focusing on tracking the bare hand and recognizing hand gestures without the help of any markers and gloves.

HAND GESTURE RECOGNITION

Posture levelViola & Jones Algorithm

Gesture levelGrammar-based analysis

• To detect the hand, the image is scanned by a sub-window containing Haar-like feature.

• Based on each Haar-like feature fj , a weak classifier hj(x) is defined as:

where x is a sub-window, and θ is a threshold. pj indicating the direction of the inequality sign.

Posture Recognition

• Four hand postures have been tested with Viola & Jones algorithm:

• Input device: A low cost Logitech QuickCam web-camera with a resolution of 320 × 240 up at 15 frames-per-second.

• Training samples collection:– Negative samples: images that must

not contain object representations. We collected 500 random images as negative samples.

– Positive samples: hand posture images that are collected from humans hand, or generated with a 3D hand model. For each posture, we collected around 450 positive samples. As the initial test, we use the white wall as the background.

Adaboost starts with a uniform distribution of “weights” over training examples. The weights tell the learning algorithm the importance of the example.

Obtain a weak classifier from the weak learning algorithm, hj(x).

Increase the weights on the training examples that were misclassified.

(Repeat)

At the end, carefully make a linear combination of the weak classifiers obtained at all iterations.

)()()( ,11,final xxx nnfinalfinal hhf αα ++= K

• After the training process based on the AdaBoosting learning algorithm, we get a cascade classifier for each hand posture when the required accuracy is achieved:

– “Two-finger” posture: 15 stage cascade classifier;– “Palm” posture: 10 stage cascade classifier;– “Fist” posture: 15 stage cascade classifier;– “Little finger” posture: 14 stage cascade classifier.

• The performance of trained classifiers for 100 testing images:

• To recognize these different hand postures, a parallel structure that includes all of the cascade classifiers is implemented:

Hand gestures constructed by different hand postures.

Stochastic context-free grammars (SCFG) are used to describe the structural information of the hand gestures. Each SCFG is a four tuple:

where VN andVT are finite sets of non-terminals and terminals. S∈VN is a start symbol. PS is a finite set of stochastic production rules with the format:

where X∈VN , λ∈V+ (i.e., VN, VT, or the combination of them) and P is the probability associated with this production rule. The SCFG that generates these gestures is defined by:

where

and PG:

where the terminals p, f, t, l stand for the four postures: “palm”, “fist”, “two fingers”and “little finger”.

),,,( SPVVG STNs =

λ→PX

),,,( SPVVG GTGNGG =

),,,{},{ ltfpVSV TGNG ==

lfSrtfSrpfSr →→→ %253

%352

%401 :,:,:

Interest in facial expression can be dated back to the mid 19th century, when the most influential theorist Charles Darwin wrote The Expression of the Emotions in Man and Animals. Later, two sign communication psychologists, Ekman and Friesen, developed the anatomically oriented Facial Action Coding System(FACS) based on numerous experiments with facial muscles. They defined the Action Unit (AU) as a basic visual facial movement, which cannot be decomposed into smaller units. The distinguishable expression space is reduced to a comprehensive system, which could distinguish all possible visually facial expressions by using only 46 AUs. Complex facial expressions can be obtained by combining different AUs.

FACIAL EXPRESSION RECOGNITION

Model-Based Face Tracking and Expression Recognition

3D Face Modeling• Modeling and animating realistic faces require

knowledge of anatomy– Anthropometric (external) representation

• Measurements of living subjects• Statistics based on age, health, etc.

– Muscle/Skin (internal) representation• Over 200 facial muscles• Over 14,000 possible expressions

3D generic face deformed using muscle-based control

Facial expressions are described using the Facial Action Coding System, allowing to control the movements of specific facial muscles.

The skin color distribution of people with different skin colors forms a compact cluster, with a regular shape in rg (or HS)-chromatic color space. => Modeling human faces as a Mixture of Gaussian (MOG) distributions in the 2D normalized color space.

Real-time tracking of 2½D head parameters: position and orientation

Once the head is detected, an elliptical outline is fitted to the head contour. Every time a new image becomes available, the tracker will try to fit the ellipse model from the previous image in such a way to best approximate the position of the head in the new image. Essentially, tracking consists of an update of the ellipse state to provide a best model match for the head in the new image. The state is updated by a hypothesize-and-test procedure in which the goodness of the match is dependent upon the intensity gradients around the object’s boundary and the color of the object’s interior.

Linear Kalman Filter (LKF) for Head Tracking

The measurement values obtained by tracking are quite naturally corrupted by noise, resulting in an unstable tracking behaviour. The face ellipse will be jumpy and easily lose the locked target. Localization errors in the face tracking propagate to the recovered pose parameters. When used for synthesis, applying these pose computations to a 3D-head model results in jerky movements, of the animated head. In order to overcome this inconvenience we use an optimal discrete LKF to process the measurements of the tracking parameters for each frame.

The continuous linear imaging process is sampled at discrete time intervals by grabbing images at a constant time interval. These images are then sequentially analyzed using a LKF to determine the motion trajectory of the face within a determined error range.

The LKF is a recursive procedure that consists of two stages: time updates (or prediction) and measurement updates (or correction). At each iteration, the filter provides an optimal estimate of the current state using the current input measurement, and produces an estimate of the future state using the underlying state model.The values, which we want to smooth and predict independently, are the tracker state parameters. The tracker will employ a LKF as a recursive motion prediction tool, for the recovery of the 2½D head pose parameters.

Tracking 3D Head Motions

The general problem of recovering 3D position parameters from 2D images could be solved using different 2D views of the 3D objects. If these 2D images are taken at the same time the problem is solved by stereovision. Another approach using monocular 2D images of moving objects is known as Structure-From-Motion (SFM).

Given 2D-object images the SFM problem aims to recover:•the 3D object coordinates•the relative 3D camera- object motion•camera geometry (camera calibration)

The SFM problem assumes no prior knowledge about the 3D model and motion, and camera calibration. SFM aims to recover these 3D parameters from 2D observations over a sequence of images. The SFM framework consists of two main modules:

Due to the perspective camera model SFM is a nonlinear problem.

The SFM system recursively recovers the 3D structure, 3D motion and perspective camera geometry from feature correspondences over a sequence of 2D images. To speed up the calculations we are using a motion model that simplifies the Jacobian.

Extended Kalman Filter (EKF)is used to solve the SFM problem resulting in an accurate, stable and real time solution. This EKF takes in consideration the non-linear aspect of mapping. We use a perspective camera model to reflect the mapping between the 3D world and its projection.

InitializationProcess

2D-track each feature:

( ) ( )realirealiestiesti vuvu ____ ,, ⇒

2D tracker estimate for next frame:

( ) ( )++ ′′⇒ estiestirealireali vuvu ____ ,,

EKF estimate for next frame, and SFMoutput:

( ) ( )++ ′′′′⇒ estiestirealireali vuvu ____ ,,),,,,,,,,,,,,,( izyxzyxzyxzyx Ztttftttx ωωωωωω &&&&&&

Select best estimatefor next frame n+:

( )++ estiesti vu , , n+

Calibratecamera:

Translate androtate the model(modify pose):

Adjuststructure:

Grab frame n

Real-time EKF tracking

Face and lip animation using model-based audio and video coding

The parameters of the lip contour model are:xo, yo = the origin of the outside parabolas; xi, yi = the origin of the inside parabolas; Bo= outer height; Bi = inner height; Ao = outer width; Ai = inner width; D = depth of ‘dip’; C = width of ‘dip’; E = offset height of cosine function; tordero = top outside parabola order; bordero = bottom outside parabola order; orderi =inside parabola order (same on both top an bottom).

The lip contur model used in the mapping:The only parameters of the lip model that are associated to the cepstral coefficients are the outer width Ao and the outer height Bo. Relations can be found linking the parameter values of the inner contour of the lip model to the parameter values of the outer contour. Therefore, estimating the inner contour values from the audio signal would be redundant.

Examples of the lip model being molded to the shape of the speaker lips

Comparing the speech-driven and the real lip shape for a female speakersaying in French the ten digits: zero, un, deux,...neuf.

• Simulate the dynamics of real muscles • Facial Action Coding System (FACS)

– Facial articulation as Expression Action Units (EAUs)– 7 pairs of muscles + “Jaw Drop” = Expression Space

Model-Based Facial Expression Recognition

3D Anthropometric Muscle-Based Active Appearance Model (AMB AAM)

• Use 3D generic face model• Muscle “contractions” control mesh deformation in “Anthropometric-Expression (AE)”

space • Texture intensities are warped into the geometry of the shape

– Shape: apply PCA in AE space – Appearance: apply PCA in texture space

• Model defined by rigid (rotation, translation) and non-rigid motion (AE) • Model instances synthesized from AE space:

Facial Image Segmentation

• Search examples:– 2D-AAM (second row)– 3D-AMB-AAM (third row)

Facial Expression Recognition

• Person Dependent

• Person Independent

Tracker Implementation

• Combined Tracking System– Feature-Motion-Based: recovers rigid motion using an Extended

Kalman Filter (EKF)– Statistical-Model-Based: recovers non-rigid motion using 3D-AMB-

AAM, Analysis-by-Synthesis (ABS) technique, and EKF

• Advantages using 3D-AMB-AAM:– Self-Occlusion Prediction– Feature Position Control – Texture Update – 3D Structure Constraint– Expression-Structure Independence

• Advantages using EKF:– Recursive algorithm for recovering 3D structure and/or motion from a

monocular sequence (SFM)– Models the non-linear mapping between 3D world and its projection – Accepts measurement recursively and models observation

uncertainties– Helps fusing the information from the two trackers– Delivers accurate, stable and real time solution – Predicts global and non-rigid motion -> reduces the active model

search space

FUTHER DEVELOPMENTS

Hand gestures and facial expressions are powerful non-verbal context-dependent human-to-human communication modalities.

While understanding them may come naturally to humans, describing them in an unambiguous algorithmic way is not an easy task. We will use Fuzzy Neural Networks and Fuzzy Cognitive Maps to develop anexpert system that captures the collective wisdom of human experts on the best procedures to extract the semantic information from the multimodal gesture data streams. In order to reduce the level of the hand gesture and facial expression meaning ambiguities they will be

considered in conjunction with the other contextual and communication modalities using Hidden Markov Model and Bayesian network of naïve classifiers for feature-based analysis to discard false gesture primitives.

Q. Chen, “Real-Time Vision-Based Hand Tracking and Gesture Recognition,” Ph.D. Thesis, 2008.

A. El-Sawah, “Towards Context-Aware Gesture Enabled User Interfaces,” Ph.D. Thesis, 2008.

M.D. Cordea, “A 3D Anthropometric Muscle-Based Active Appearance Model for Model-Based Video Coding," Ph.D. Thesis, 2007.

M. Bondy, “Voice Stream Based Lip Animation for Audio-Video Communication,” M.A.Sc. Thesis, 2001.

M.D. Cordea, "Real Time 3D Head Pose Recovery for Model Based Video Coding," M.A.Sc. Thesis, 1999.

Ottawa “U” Research Group – Relevant Graduate Theses

M.D. Cordea, E.M. Petriu, “A 3-D Anthropometric-Muscle-Based Active Appearance Model,” IEEE Trans. Instrum. Meas., Vol. 55, No. 1, pp. 91 - 98, 2006. M.D. Cordea, D.C. Petriu, E.M. Petriu, N.D. Georganas, and T.E. Whalen, “3-D Head Pose Recovery

for Interactive Virtual Reality Avatars,“ IEEE Trans. Instrum. Meas., Vol. 51, No. 4, pp. 640 -644, 2002.M.D. Cordea, E. M. Petriu, N.D. Georganas, D.C. Petriu, and T.E. Whalen, "Real-Time 2½D Head

Pose Recovery for Model-Based Video-Coding," IEEE Trans. Instrum. Meas., Vol. 50, No. 4 , pp.1007 –1013, 2001.

Q. Chen, M.D. Cordea, E.M. Petriu, I. Rudas, A. Varkonyi-Koczy, T.E. Whalen, “Hand-Gesture and Facial-Expression Human-Computer Interfaces for Intelligent Space Applications,” Proc. MeMeA 2008 – IEEE Int. Workshop on Medical Measurements anad Applications, Ottawa, ON, Canada, May 2008.

Q. Chen, E.M. Petriu, N. D. Georganas, “3D Hand Tracking and Motion Analysis with a Combination Approach of Statistical and Syntactic Analysis,” Proc. HAVE 2006 - IEEE Int. Workshop on Haptic, Audio and Visual Environments and their Applications, pp. 56-61, Ottawa, ON, Canada, Oct. 2007.

A. El-Sawah, C. Joslin, N.D. Georganas, E.M. Petriu, “A Framework for 3D Hand Tracking and Gesture Recognition using Elements of Genetic Programming,” Proc. VideoRec'07: Int. Workshop on Video Processing and Recognition, pp. 495-502, Montreal, Que, Canada, May 2007.

Ottawa “U” Research Group - Publications in Human Face and Hand Gesture Recognition

A. El-Sawah, C. Joslin, N.D. Georganas, E.M. Petriu, “A Framework for 3D Hand Tracking and Gesture Recognition using Elements of Genetic Programming,” Proc. VideoRec'07: Int. Workshop on Video Processing and Recognition, pp. 495-502, Montreal, Que, Canada, May 2007.

A. El-Sawah, N.D. Georganas, E.M. Petriu, “Calibration and Error Model Analysis of 3D Monocular Vision Model Based Hand Posture Estimation,” (6 pages), Proc. IMTC/2007, IEEE Instrum. Meas. Technol. Conf., Warsaw, Poland, May 2007.

Q. Chen, N.D. Georganas, E.M. Petriu, “Real-Time Vision-Based Hand Gesture Recognition with Haar-like Features and Grammars,” (6 pages), Proc. IMTC/2007, IEEE Instrum. Meas. Technol. Conf., Warsaw, Poland, May 2007.

M. D. Bondy, E. M. Petriu, M. D. Cordea, N. D. Georganas, D. C. Petriu, T. E. Whalen, “Model-based Face and Lip Animation for Interactive Virtual Reality Applications”, Proc. ACM Multimedia 2001, pp. 559-563, Ottawa, ON, Sept. 2001.

Human Face and Gesture Recognition - University of...

Documents

Transcript of Human Face and Gesture Recognition - University of...