A Uniﬁed Probabilistic Framework For Measuring …cvrl/Publication/pdf/Li2013b.pdfA Uniﬁed...

A Unified Probabilistic Framework For Measuring The Intensity ofSpontaneous Facial Action Units

Yongqiang Li1, S. Mohammad Mavadati2, Mohammad H. Mahoor and Qiang Ji

Abstract— Automatic facial expression analysis has receivedgreat attention in both academia and industry in the pasttwo decades. Facial action coding system, which describesall possible facial expressions based on a set of anatomicalfacial muscle movements, called Action Unit (AU), is themost popularly used descriptive approach for analyzing facialexpressions. In majority of the existing studies in the area offacial expression recognition, the focus has mostly been on facialaction unit detection or basic facial expression recognition andthere have been very few works on investigating the measuringthe intensity of spontaneous facial actions. In addition, theseworks try to measure the intensity of facial actions staticallyand individually, ignoring the dependence among AUs, as wellas the temporal information, which is crucial for analyzingspontaneous expression. To overcome this problem, this paperproposes a framework based on Dynamic Bayesian Network(DBN) to systematically model such relationships among spon-taneous AUs for measuring their intensities. Our experimentalresults show improvement over image-driven methods alone inAU intensity measurement.

I. INTRODUCTION

Facial expression is one of the common nonverbal com-munication skills that humans use in their daily socialcommunications and interactions. In the past two decades,many researchers in the field of computer vision and patternrecognition have been attracted to develop computer-basedtechniques to automatically recognize facial expressions invisual data. This is due to different applications, such asdevelopmental psychology, emotive social robots, and intel-ligent Human-Computer Interaction (HCI) design [1], [2].

In order to describe and analyze facial expressions, severalcoding systems have been developed by psychologists. TheFacial Action Coding System (FACS), which was origi-nally developed by Ekman in 1970s, is one of the mostcomprehensive coding system in the behavioral science [3].FACS describes all possible facial expressions based on aset of anatomical facial muscle movements and called themAction Unit (AU). For instance, AU12 (lip corner puller)codes contractions occurred on the face by Orbicularis oculimuscle and AU6 (cheek raiser) codes contractions occurringby Zygomaticus major muscle [3].

Traditionally for facial expression analysis and AU inten-sity measurement, expert FACS coders manually codify im-ages or video frames. However, this is a very labor intensive

1, 2 contribute equally to this work.Yongqiang Li and Qiang Ji are with Rensselaer Polytechnic In-

stitute, Department of Electrical, Computer, and Systems Engineering{liy23,jiq}@rpi.edu

S. Mohammad Mavadati and Mohammad H. Mahoor are with Uni-versity of Denver, Department of Electrical and Computer Engineering{smavadat,mmahoor}@du.edu

Fig. 1. Relation between the scale of evidence and intensity scores forfacial action units [3].

and time consuming task. Studying the literature reveals thatcomputer algorithms can help scientist for recognizing andmeasuring facial expressions automatically [4], [5]. Althoughautomatic facial expression measurement has been utilizedin distinguishing between posed and spontaneous occurringsmile [6] and categorizing pain-related facial expression[7], there are still many areas which suffer from lack ofcomprehensive studies.

In a real face-to-face communication, we deal with spon-taneous facial expressions. Posed facial expressions andaction units are those that are created by asking subjects todeliberately make specific facial actions or expressions. Onthe other hand, spontaneous facial expressions and actionunits are representative of facial expressions in daily life.They typically occur in uncontrolled conditions and are com-bined with head pose variation, head movement and oftenmore complex facial action units. Most of the developedsystems for facial expression and action unit classificationare evaluated using posed expressions data [23]. One reasonis that the majority of available databases, are focusing onposed facial expression and there are very few databases thatare available for studying spontaneous facial expression. [7],[8], [9], [30].

For automatic facial expression recognition, there are somevaluable databases that contain either the six basic facialexpressions (i.e. anger, surprise, fear, sadness, joy and dis-gust) or combination of AUs, among which Cohen-Kanadedatabase [10], MMI database [11], Bosphorus database [12]are the AU-coded face databases that are publicly availablefor research studies. Recently a new database, called DenverIntensity of Spontaneous Facial Action (DISFA) has beenpublished [9] [30], which contains the intensity of 12 actionunits.

To measure the intensity of action units, as defined in theFACS manual [3], there are five ordinal scales (i.e. scale Athrough E that respectively indicate the the barely visible tomaximum intensity of each AU). The general relationshipbetween the scale of evidence and the A-B-C-D-E intensityscoring is illustrated in Fig. I. Generally, the A level refersto a trace of the action; B, slight evidence; C, marked orpronounced; D, severe or extreme; and E, maximum evidence.For example, we use “AU12B” to indicate AU12 with a B

intensity level. In this study we utilized the DISFA databasefor measuring the intensity of spontaneous facial action units.

Analyzing the spontaneous facial expression is a challeng-ing task and currently there are very few studies availablein this area. Bartlett et al. [13] attempted to measure theintensity of action units in posed and spontaneous facialexpressions, by using Gabor wavelet and support vectormachines. They have reported the average correlation valuesof 0.3 and 0.63 between a human coder and the predictedintensity of AUs for spontaneous and posed expressions,respectively. The reported results demonstrate that measuringthe intensity of spontaneous expressions is more challengingthan measuring the intensity of posed expressions. In anotherstudy for spontaneous facial expression measurement [5], theauthors used AAM features in conjunction with SVM classi-fiers to automatically measure the intensity of AU6 and AU12in videos captured from infant-mother communications. In[14], the histogram of oriented gradient and Gabor featureswere utilized for detecting spontaneous action units by usingK-Nearest neighbor and SVM classifiers.

In majority of the studies in the area of facial expression,the focus of the studies are mostly on facial action unitdetection or basic facial expression recognition and there arevery few works which are measuring the intensity of facialactions [5], [15]. To the best of the author’s knowledge,most of the current studies, including [5], [15], measurethe intensity of facial actions statically and individually, andthe dependencies among multilevel AU intensities, as wellas the temporal information, which are crucial for analyz-ing spontaneous expression are ignored. Tong et al. [23]employed dynamic Bayesian Network (DBN) to model thedependencies among AUs, and achieved improvement overimage-driven methods alone, especially for recognizing AUsthat are difficult to detect but have strong relationships withother AUs. However, work [23] focuses on AU detection ofposed expression.

Following the idea in [23], in this paper we introducea framework based on DBN to systematically model therelationships among different intensity levels of AUs, inorder to measure the intensity of spontaneous facial actions.The proposed probabilistic framework is capable of recog-nizing multilevel AUs intensities in the spontaneous facialexpressions.

II. OVERVIEW OF OUR APPROACH

The focus of this paper is to develop a framework tomeasure the intensity of spontaneous facial action units fromabsence of an AU to maximum intensity level of an AU.Fig. 2 gives the flowchart of our proposed system, whichconsists of an offline training phase (Fig. 2(a)) and an onlinetesting phase (Fig. 2(b)). The training phase includes trainingmulti-class Support Vector Machine (SVM) and learningthe DBN to capture the semantic and dynamic relationshipsamong AUs. Advanced learning techniques are applied tolearn both the structure and parameters of the DBN basedon both training data and domain knowledge. The onlineAU intensity recognition phase consists of two independent

Database

Images and

labels

Labels

SVM Training

Structure

Learning

Parameter

Learning

DBN Learning

HOG/Gaborr

Feature Extraction

SVM

Learning

(a)

Input Measurement Extraction

HOG/Gabor

Feature Extraction

SVM

Classification

DBN

InferenceAU Intensity

Output

(b)

Fig. 2. The flowchart of the proposed system. (a) Training process. (b)Testing process.

but collaborative components: AU observation extractionby SVM classification and DBN inference. For observationextraction, we employ HOG and Gabor features describinglocal appearance changes of the face, that are followed bySVM classifiers. Given the AU observation, we estimatethe intensity of facial action units through a probabilisticinference with the DBN model. Through this way, we canfurther incorporate the dependencies among multilevel AUintensities, as well as the temporal information. The remain-der of the paper is organized as follows. Sec. III describesthe AU observation extraction method. In Sec. IV, we buildDBN model for AU intensity recognition, including BNmodel structure learning (Sec. IV-A), DBN model parameterlearning (Sec. IV-C) and DBN inference (Sec. IV-D). Sec. Vpresents our experimental results and discussion and Sec. VIconcludes the paper.

III. AU INTENSITY OBSERVATION EXTRACTION

This section describes our proposed AU intensity observa-tion extraction method, which consists of several component,such as face registration (Sec. III-A), facial image represen-tation (Sec. III-B), dimensionality reduction (Sec. III-C) andclassification (Sec. III-D). The flowchart of our AU intensityobservation extraction method is shown in Fig. 3.

A. Face Registration

Image registration is a systematic way for aligning two im-ages of same object (i.e. the reference and sensed image) thatare taken at different time, viewpoint, or with different sen-sors. In order to register two images efficiently, oftentimes,a set of points called control points or landmark points, areutilized for representing the object in both images. In ourstudy, we used the 66 landmark points of DISFA database(i.e. points for representing mouth boundary, corners of eyes,tip of nose, face boundary, etc.) to represent the location ofimportant facial components [9]. The Reference landmarkpoints were obtained by averaging the 66 landmark pointsover the whole training set. A 2D similarity transformationwas calculated between the reference points and the targetpoints. Afterwards, we utilized the corresponding points, the

Fig. 3. The flowchart of AU intensity observation extraction method.

calculated transformation function and the bilinear interpola-tion technique, to transform the new image into the referencecoordinate system.

B. Facial Image Representation

After registering facial images, we utilized two well-known feature extraction techniques that are capable ofrepresenting the appearance information. These features areHistogram of Oriented Gradient (HOG), and Localized Ga-bor Features which are described below.

1) Histogram of Oriented Gradient: The histogram oforiented gradient was firstly introduced by Dalal and Triggsfor the application of human detection [16]. HOG is a de-scriptor which counts the occurrences of gradient orientationin localized portion of an image and it can efficiently describethe local shape and appearance of an object. To representthe spatial information of an object, images are divided intosmall cells and for each cell, the histogram of gradient iscalculated (For more information about gradient filters andnumber of histogram bins, interested readers are referred to[16]).

In our experiment, for every image (i.e image size is 108×128 pixels), we built a cell with 18×16 pixels and in overall48 cells are constructed out of each image. We applied thehorizontal gradient filter [−1 0 1] with 59 orientation binsin our experiments. In order to construct the HOG featurevector, the HOG representation for all the cells were stackedtogether and finally the HOG feature vector with size 2832(48× 59) was obtained.

2) Localized Gabor Features: Gabor wavelet is anotherfamous technique for representing texture information of anobject. A Gabor filter is defined with a Gaussian kernelthat is modulated by a sinusoidal plane. Gabor feature hasa powerful capability for representing facial textures andhas been used for different applications including facialexpression recognition [17]. In our experiment, to efficientlyextract both texture and shape information of facial images,40 Gabor filters (i.e. 5 scales and 8 orientations) were appliedto regions defined around every 66 landmark points and asa result 2640 Gabor features were extracted.

C. Dimensionality Reduction: Manifold Learning

In many real world applications in machine learningand pattern classification, high-dimensional features make

analyzing the samples more complicated. In this regards, anumber of algorithms, such as Principle Component Analysis(PCA), Linear Discriminant Analysis (LDA), and manifoldlearning have been proposed to reduce the dimensions offeatures [18]. Amongst, manifold learning is a nonlineartechnique, which assumes that the data points are sampledfrom a low dimensional manifold but they are embeddedinto a high dimensional space. Mathematically speaking,given a set of points x1, ..., xn ∈ RD find a set of pointy1, ..., yn ∈ Rd (d << D) such that yi represents xi

efficiently.Currently, there are several different manifold learning

techniques (e.g ISOMAP, Local Linear Embedding (LLE),Laplacian Eigenmap, etc.), in which the following three stepsare in common: 1) Building a nearest neighbor graph for theentire set of sample points, 2) Linearly approximate the localmanifold geometry within each neighborhood, 3) Define andminimize the cost function in a way to get the best lowdimensional representation. The key assumption in everymanifold learning algorithm is that in neighborhood of eachsample point the manifold is smooth [19]. Several studies[20], [5] show that manifold learning techniques outperformlinear techniques (e.g., PCA) in reducing the dimensionalityof data, such as facial expressions and human actions. In thispaper we utilized Laplacian Eigenmap technique to extractlow dimensional features of facial images.

The Laplacian Eigenmap algorithm [21], was originallyintroduced by M. Belkin and P. Niyogi in 2001. In this algo-rithm, after finding the K-nearest neighbors of each samplepoint, the node i and j are connected if xi is among thek-nearest neighbors of xj otherwise they are disconnected.To calculate the weight of connected neighbor samples, theheat kernel Wij = exp−∥

xi−xjt ∥ is one approach. The cost

function for Laplacian Eigenmap is∑

ij Wij∥ yi − yj ∥2 =tr(Y tLY ) which aims to map the close points in high di-mensional space into the close points in the low dimensionalone. The generalized eigenvector problem is applied forsolving Lfl = λlDfl, where D is diagonal weight matrixDii =

∑j Wij and L = D − W is symmetric, positive

semidefinite Laplacian matrix. Let’s assume the f1, ..., fdare the first d eigenvectors corresponding to the first dsmallest eigenvalues (0 < λ1 ≤ λ2 ≤, ... ≤ λd, ...) thenfor embedding in d-dimensional Euclidean space we canapply xi → (f1(i), ..., fd(i)). Readers can see more details

(a) (b) (c)

Fig. 4. Nonadditive effect in an AU combination. (a) AU12 occurs alone.(b) AU15 occurs alone. (c) AU12 and AU15 appear together. (Adaptedfrom [3])

in [21]. Similar to [5], we utilized Spectral Regression (SR)algorithm to find a projection function which can map thehigh dimensional data, such as HOG and Gabor features,into low dimensional space.

D. Classification

Given the reduced feature vectors, we extract the AUintensity observation through SVM classification. SVM isone of the classifiers which has gained popularity for patternrecognition in the last decade. The SVM classifiers aimsto find a hyperplane with the maximum margin. Thereare several parameters, such as kernel’s type (e.g. Linear,Polynomial, Radial Basis Function (RBF) kernels), thatcan affect the efficiency of the SVM classifier. For moredetailed information on SVM we refer readers to [22].For AU intensity observation extraction, in our experimentwe utilized the multiple SVM classifiers with One-against-One strategy and we examined three different kernels (i.e.Linear, Polynomial and Gaussian RBF kernels) where theGaussian RBF outperformed the other two kernels. Althoughwe can extract AU intensities up to some accuracy, thisimage-appearance-based approach treats each AU and eachframe individually and largely relies on the accuracy of faceregion alignment. In order to model the dynamics of AUs,as well as their semantic relationships, and to deal withthe image uncertainty, we utilize a DBN for AU inference.Consequently, the output of the SVM classifier is used as theevidence for the subsequent AU inference via the DBN.

IV. DBN MODEL FOR FACIAL ACTION UNIT INTENSITYRECOGNITION

A. AU Dependencies Learning

Measuring the intensity of each AU statically and individ-ually is difficult due to the variety, ambiguity, and dynamicnature of facial actions. This is especially the case forspontaneous facial expressions. Moreover, when AUs occurin a combination, they may be non-additive: this means that,the appearance of an AU in a combination is different fromits stand-alone appearance. Fig. 4 demonstrates an exampleof the non-additive effect: when AU12 (Lip corner puller)appears alone, the lip corners are pulled up toward thecheekbone; however, if AU15 (lip corner depressor) is alsobecame active, then the lip corners are somewhat angleddown due to the presence of AU15. The non-additive effectincreases the difficulty of recognizing AUs individually.

Fortunately, there are some inherent relationships amongAUs, as described in the FACS manual [3], i.e., Co-occurrence relationships and Mutual exclusion relationships.The co-occurrence relationships characterize some groups ofAUs, which usually appear together to show meaningful fa-cial emotions. For example, AU1+AU2+AU5+AU26+AU27to show surprise; AU6+AU12+AU25 to represent happiness,etc.

On the other hand, based on the alternative rules providedin the FACS manual, some AUs are mutually exclusivesince “it may not be possible anatomically demonstrateAU combinations simultaneously” or “the logic of FACSprecludes the scoring of both AUs” [3]. For instance, onecan not perform AU25 (Lip part) with AU23 (Lip tightener)or AU24 (Lip pressor) simultaneously. Furthermore, there arealso some restrictions on the AU intensities besides the co-occurrence and mutual exclusion relationships. For instance,when AU6 (Cheek raiser) and AU12 (Lip corner puller) arepresent together, the high/low intensity of one AUs indicatesa high probability of high/low intensity of the other one.At the same time, for the combination of AU10 (Upper lipraiser)+AU12 (Lip corner puller), “one cannot score AU10as D or E if the AU12 is a AU12D or E, since such strongactions of AU12 counteract the influence of AU10 on theshape of the upper lip. With such strong AU12, one canonly able to score AU10 as AU10A, B, or C. With a 12C orless, one may be able to score AU10E.” [3]

Tong et al. [23] employed a Bayesian network to modelthe co-occurrence and mutual exclusion relationships amongAUs. However, [23] focuses on AU detection, which onlyrecognizes AU’s absence or presence. In addition, [23]detects AUs from posed expressions, which are created byasking subjects to deliberately make specific facial actionsor expressions. Spontaneous expressions, on the other hand,typically occur in uncontrolled conditions, and are morechallenging to measure [13]. In this work, following the ideain [23], we adopt Bayesian Network (BN) to capture thesemantic relationships among AUs, as well as the correla-tions of the AU intensities, for measurement of the intensityof spontaneous facial actions. A BN is a Directed AcyclicGraph (DAG) that represents a joint probability distributionamong a set of variables. In this work, we employ 12 hiddennodes representing 12 AUs of DISFA database, (i.e, AU1AU2 AU4 AU5 AU6 AU9 AU12 AU15 AU17 AU20 AU25AU26), each of which has six discrete states indicating theintensity of the AU. In a BN, its structure captures thedependence among variables and the structure is crucial foraccurately modeling the joint probabilities among variables.In this work, we learn the BN structure directly from thetraining data. The learning algorithm is to find a structure Gthat maximizes a score function. We employ the BayesianInformation Criterion (BIC) score function [24] which isdefined as follows:

sD(G) = maxθ

logP (D|G, θ)− logM

2DimG (1)

where the first term evaluates how well the network fits the

AU4 AU15

AU1

AU2

AU5

AU9

AU20AU17

AU25

AU12AU6

AU26

Fig. 5. The learned BN structure from training data.

data D; the second term is a penalty relating to the com-plexity of the network; logP (D|G, θ) is the log-likelihoodfunction of parameters θ with respect to data D and structureG; M is the number of training data; and DimG is thenumber of parameters.

Cassio et al. [25] developed a Bayesian Network structurelearning algorithm which is not dependent on the initialstructure and guarantee a global optimality with respect toBIC score. In this work, we employ the structure learningmethod [25] to learn the dependencies among AUs, as wellas the correlations of AU intensities. To simplify the model,we use the constraints that each node has at most two parents.The learned structure is shown in Fig. 5.

B. Dynamic Dependencies Analysis

The above BN structure can only capture the static depen-dencies. In this section, we extend it to a dynamic Bayesiannetwork by adding dynamic links. In general, a DBN ismade up of interconnected time slices of static BNs, andthe relationships between two neighboring time slices aremodeled by an HMM such that variables at time t areinfluenced by other variables at time t, as well as by thecorresponding random variables at time t − 1 only. In theproposed framework, we consider two types of conditionaldependencies for variables at two adjacent time slices. Thefirst type, i.e., an arc from AUi node at time t − 1 tothat node at time t, depicts how a single variable developsover time. For instance, since spontaneous facial expressionchanges smoothly, the intensity of AUs has high probabilityto change in order, either in ascending order or in descendingorder. Such dynamic restrictions are modeled by the first typedynamic links, and we consider such dynamic link for everysingle AU.

The second type, i.e., an arc from AUi at time t − 1 toAUj(j ̸= i) at time t, depicts how AUi at the previoustime step affects AUj(j ̸= i) at the current time step.This dynamic dependence is also important for understandingspontaneous expression. For example, K. Schmidt et al. [26]found that certain action units usually closely follow theappearance of AU12 in smile expression. For 88% of thesmile data they collect, the appearance of AU12 was eithersimultaneously with or closely followed by one or moreassociated action units, and for these smiles with multipleaction units, AU6 was the first action unit to follow AU12in 47%. Messinger et al. in [27] also show that AU6 mayfollow AU12 (smile) or AU20 (cry) to act as an enhancer to

AU4 AU15

AU1

AU2

AU5

AU9

AU20AU17

AU25

AU12AU6

AU26

AU2

AU12

MAU4

MAU

15

MAU1

MAU2

MAU5

MAU9

MAU

20

MAU

17

MAU

25

MAU

12

MAU6 MAU

26

t-1 t

Fig. 6. The complete DBN model for AU intensity recognition. The shadednode indicates the observation for the connected hidden node. The self-arrowat the hidden node represents its temporal evolution from previous time sliceto the current time slice. The link from AUi at time t− 1 to AUj(j ̸= i)at time t indicates the dynamic dependence between different AUs.

enhance the emotion. This means that certain AU in next timestep may be affected by other AUs in the current time step.Analysis of other expressions results in a similar conclusion.Based on this understanding and the analysis of the database,as well as the temporal characteristics of the AUs we intendto recognize, in this work, we link AU2 node and AU12node at time t − 1 to AU5 node and AU6 node at time t,respectively to capture the second type dynamics. Fig. 6 givesthe whole picture of the dynamic BN, including the shadedvisual observation nodes. For presentation clarity, we usethe self-arrows to indicate the first type of temporal links asdescribed above.

C. DBN Parameter Learning

Given the DBN structure, now we focus on learning theparameters from training data in order to infer the hiddennodes. DBN can be seen as a pair of BN and parameterlearning are the same in implementation. Learning the pa-rameters in a BN is to find the most probable values θ̂ forθ that can best explain the training data. Let θijk indicatesa probability parameter,

θijk = p(xki |paj(Xi)) (2)

where i ranges over all the variables (nodes in the BN), jranges over all the possible parent instantiations for variableXi and k ranges over all the instantiations for Xi itself(intensity levels of AUs). Therefore, xk

i represents the kthstate of variable Xi. In this work, the “fitness” of parametersθ and training data D is quantified by the log likelihood func-tion log(p(D|θ)), denoted as LD(θ). Assuming the trainingdata are independent, based on the conditional independenceassumption in BN, we have the log likelihood function inEq. 3, where nijk is the count for the case that node Xi

has the state k, with the state configuration j for its parentnodes.

LD(θ) = logn∏

i=1

qi∏j=1

ri∏k=1

θnijk

ijk (3)

Since we have a complete training data, i.e., for eachframe we have the intensity labels for all 12 AUs, Maximum

Likelihood (ML) estimation method can be described asa constrained optimization problem, i.e. maximize (Eq. 4),subject to n equality constraints

Max LD(θ)

S.T. gij(θ) =

ri∑k=1

θijk − 1 = 0(4)

where gij imposes the constraint that the parameters of eachnode sums to 1 over all the states of that node, 1 ≤ i ≤ nand 1 ≤ j ≤ qi. Solving the above equations, we can getθijk =

nijk∑k nijk

.

D. DBN Inference

Given the complete DBN model and the AU observations,we can estimate the true state of hidden nodes by maximizingthe posterior probability of the hidden nodes. Let AU t

1:N

represent the nodes for N target AUs at time t. Given theavailable evidence until time t: MAU1:t

1:N , the probabilityp(AU t

1:N |MAU1:t1:N ) can be factorized and computed via

the facial activity model by performing the DBN updatingprocess as described in [28]. Because of the recursive natureof the inference process as well as the simple networktopology, the inference can be implemented rather efficiently.

V. EXPERIMENTAL RESULTS

In the following section, we have utilized the DISFAdatabase for evaluating the performance of automatic mea-surement of the intensity of spontaneous action units. Firstwe introduce the contents of DISFA and then the results ofthe proposed system for measuring the intensity of 12 AUsof this database are reported.

A. DISFA Database Description

Denver Intensity of Spontaneous Facial Action (DISFA)database [9] [30], contains the videos of spontaneous facialexpressions of 27 adult subjects with different ethnicities(i.e Asian, Caucasian, Hispanic, and African American).The facial images were video recorded by a high resolutioncamera (i.e. 1024× 768 pixel at 20 fps) while every subjectwatched a 4-minute emotive audio-video stimulus clip. Theintensity of 12 AUs (i.e. AU1, AU2, AU4, AU5, AU6, AU9,AU12, AU15, AU17, AU20, AU25, AU26) have been codedby a FACS coder and the six levels of AU intensities werereported in an ordinal scale (0-5 scale, where 0 representsthe absence of an AU and 1-5 represent intensity fromtrace through maximum, respectively) [3]. The database alsocontains a set of 66 landmark points that represent thecoordinates of important components of human’s face, suchas corner of the eyes and boundary of the lips , etc. [9]. In thisstudy, we utilized all the video frames of DISFA (≈ 125, 000frames) for measuring the intensity of 12 AUs.

B. Results Analysis

We evaluate our system based on leave-out-subject-outcross validation and report the average recognition resultsover all 27 subjects. In order to compare the predicted and

TABLE IAU INTENSITY RECOGNITION RESULTS (ICC) USING DIFFERENT

FEATURES ON DISFA DATABASE.

AU No. HOG Feature Gabor FeatureSVM DBN SVM DBN

AU1 0.69 0.72 0.79 0.80AU2 0.56 0.58 0.82 0.84AU4 0.82 0.86 0.87 0.87AU5 0.51 0.56 0.58 0.56AU6 0.70 0.72 0.80 0.81AU9 0.73 0.74 0.79 0.80AU12 0.83 0.85 0.84 0.84AU15 0.58 0.61 0.71 0.72AU17 0.53 0.58 0.68 0.71AU20 0.49 0.54 0.53 0.55AU25 0.85 0.86 0.93 0.94AU26 0.65 0.70 0.79 0.80Avg. 0.67 0.70 0.76 0.77

manually coded intensities of action units, we calculate Intra-Class Correlation (ICC). ICC ranges from 0 to 1 and is ameasure of correlation or conformity for a data set when ithas multiple targets [29]. In other words, ICC measures thereliability studies in which n targets of data are rated by kjudges (i.e., in this paper k = 2 and n = 6). ICC is similarto Pearson correlation and is preferred when computingconsistency between judges or measurement devices. TheICC in is defined as:

ICC =(BMS − EMS)

BMS + (k − 1)× EMS(5)

where BMS is the between-targets mean squares and EMS isthe residual mean squares defined by Analysis Of Variance(ANOVA). That is, the ICC indicates the proportion of totalvariance due to differences between targets. See [29] foradditional details.

For AU observation extraction, we employed two types offeatures, i.e., HOG feature and Gabor feature, and followedby SVM classification. Given the image observation, weestimate the intensity of each AU through the same DBNmodel and the results are given in Table I.

From Table I we can see that, for both types of fea-tures, employing DBN model yields improvement over usingimage-driven method alone, and when the image observa-tions are not very accurate, i.e., HOG feature observation,the improvement is significant. This is because that the en-hancement of the framework mainly comes from combiningthe DBN model with the image driven methods, and the erro-neous image observation could be compensated through thedynamic and semantic relationships encoded in the DBN. Forinstance, AU20 (lip stretcher) is not well recognized by usingboth HOG feature and Gabor feature, because the activationof AU20 produces subtle facial appearance changes. How-ever, AU20 (lip stretcher) is strongly exclusive with AU25(lips apart), which is recognized with high accuracy withboth kinds of features. By encoding such relationships in theDBN model, the ICC of AU20 is increased from 0.49 to 0.54for the HOG feature observation, and from 0.53 to 0.55 for

the Gabor feature observation. Similarly, by modeling the co-occurrence relationships between AU15 and AU17, for theHOG features, the ICC of AU15 is increased from 0.58 to0.61, and that of AU17 is increased from 0.53 to 0.58. Hence,we can conclude that considering the semantic relationshipsamong AUs, as well as the temporal information does helpin analyzing spontaneous facial actions.

VI. CONCLUSIONS AND FUTURE WORKS

In this paper, we presented a unified probabilistic frame-work for measuring the intensity of spontaneous facial actionunits from image sequences. Our framework consists of twoindependent but collaborative components, i.e., observationextraction and DBN inference. The enhancement of ourframework mainly comes from combining the DBN modelwith image-driven methods. For instance, the overall ICCvalue, for HOG features increased from 0.67 to 0.70 andfor Gabor feature from 0.76 to 0.77 which demonstrate thatusing the unified probabilistic frame work can improve theefficiency of AU intensity measurement system. In this study,we focused on facial images from frontal view. In order todeal with a more comprehensive problem in measuring theintensity of spontaneous action units, as a future work, wewill expand our framework by introducing another hiddenlayer nodes to model head movements.

REFERENCES

[1] Breazeal, C., “Sociable Machines: Expressive Social Exchange Be-tween Humans and Robots”. Sc.D. dissertation, Department of Elec-trical Engineering and Computer Science,MIT, 2000

[2] F. Dornaika, B. Raducanu, “Facial Expression Recognition for HCIApplications”, Prentice Hall computer applications in electrical engi-neering series, pp.625-631, 2009

[3] P. Ekman, W. V. Friesen, and J. C. Hager, “Facial Action CodingSystem”. Salt Lake City, UT: A Human Face, 2002.

[4] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel,and J. Movellan, “Recognizing Facial Expression: Machine Learningand Application to Spontaneous Behavior,” Proc. IEEE Int’l Conf.Computer Vision and Pattern Recognition (CVPR ’05), pp. 568-573,2005.

[5] M. H. Mahoor, S. Cadavid, D. S. Messinger, and J. F. Cohn, “AFramework for Automated Measurement of the Intensity of Non-Posed Facial Action Units,” 2nd IEEE Workshop on CVPR for Humancommunicative Behavior analysis (CVPR4HB), Miami Beach, June 25,2009

[6] K. L. Schmidt, Z. Ambadar, J. F. Cohn, L. I. Reed, “MovementDifferences Between Deliberate and Spontaneous Facial Expressions:Zygomaticus Major Action In Smiling”, Journal of Nonverbal Behav-ior. vol. 30(1), pp. 37-52, 2006

[7] P. Lucy, J. F. Cohn, K. M. Prkachin, P. Solomon, I. Matthrews,“Painful data: The UNBC-McMaster Shoulder Pain ExpressionArchive Database”. IEEE International Conference on Automatic Faceand Gesture Recognition (FG2011),Santa Barbra CA, March 2011

[8] W. Shangfei, L. Zhilei, L. Siliang, L. Yanpeng, W. Guobing, P. Peng,C. Fei and W. Xufa, “A Natural Visible and Infrared Facial ExpressionDatabase for Expression Recognition and Emotion Inference”, IEEETransactions on Multimedia , vol.12, no.7, pp.682-691, Nov. 2010

[9] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn,“DISFA: A Spontaneous Facial Action Intensity Database”, IEEETransactions on Affective Computing, revised and resubmitted

[10] T. Kanade, J. Cohn, and Y. Tian, “Comprehensive database for facialexpression analysis”, In Proceedings of the International Conferenceon Automatic Face and Gesture Recognition, pages 46-53, 2000.

[11] M. Pantic, M. Valstar, R. Rademaker and L. Maat, “Web-baseddatabase for facial expression analysis”, Multimedia and Expo, 2005.ICME 2005. IEEE International Conference on , vol., no., pp. 5 pp.,6-8 July 2005

[12] N. Alyz, B. Gkberk, H. Dibeklioglu, A. Savran, A. A. Salah, L. Akarunand B. Sankur, “3D Face Recognition Benchmarks on the BosphorusDatabase with Focus on Facial Expressions,” The First COST 2101Workshop on Biometrics and Identity Management (BIOID 2008),Roskilde University, Denmark, May 2008

[13] M.S. Bartlett, G.C. Littlewort, C. Lainscsek, I. Fasel, M.G Frank, J.R.Movellan, “Fully automatic facial action recognition in spontaneousbehavior”. 7th International Conference on Automatic Face and Ges-ture Recognition, p. 223-228, 2006.

[14] S. M. Mavadati, M. H. Mahoor, K. Bartlett, and P. Trinh, “AutomaticDetection of Non-posed Facial Action Units”,in the proceeding of theIEEE international conference on image processing(ICIP) , Sep-Oct2012

[15] R. Sprengelmeyer, I. Jentzsch, “Event related potentials and the per-ception of intensity in facial expressions”, Neuropsychologia, vol. 44,pp. 2899-2906, 2006

[16] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” Computer Vision and Pattern Recognition 2005, vol.1, no.,pp.886-893 ,June 2005

[17] Y. Tian, T. Kanade, J. F. Cohn, “Evaluation of Gabor-Wavelet-BasedFacial Action Unit Recognition in Image Sequences of IncreasingComplexity”, International Conference on Automatic Face and Ges-ture Recognition,pp. 229, 2002

[18] I.K. Fodor, ”A survey of dimension reduction techniques”. 2002[19] L. Cayton, “Algorithms for manifold learning”,University of Califor-

nia, San Diego, Tech. Rep. CS2008-0923,2005[20] C. Lee and A. Elgammal, “Nonlinear shape and appearance models

for facial expression analysis and synthesis”, IEEE Conference onComputer Vision and Pattern Recognition, I:313320, 2003.

[21] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for DimensionalityReduction and Data Representation”, Neural Computation, pp. 1373-1396 , vol.15, 2003

[22] V. N. Vapnik, “An overview of statistical learning theory,” NeuralNetworks, IEEE Transactions on , vol.10, no.5, pp.988-999, Sep 1999

[23] Y. Tong, W. Liao, and Q. Ji, “Facial Action Unit Recognition byExploiting Their Dynamic and Semantic Relationships”, IEEE Trans.Pattern Anal. Mach. Intell, vol. 29, no. 10, 2007.

[24] G. Schwarz, “Estimating the dimension of a model”, The Annals ofStatistics, vol. 6, pp. 461-464, 1978.

[25] C. P. de Campos and Q. Ji, “Efficient structure learning of bayesiannetworks using constraints”, Journal of Machine Learning Research,pp. 663-689, 2011.

[26] K. Schmidt and J. Cohn, “Dynamics of Facial Expression: NormativeCharacteristics and Individual Differences”, in Proc. IEEE Intl Conf.Multimedia and Expo, pp. 728-731, 2001.

[27] D. S. Messinger, W. I. Mattson, M. H. Mahoor, and J. F. Cohn, “Theeyes have it: Making positive expressions more positive and negativeexpressions more negative.” Emotion, vol. 12, pp. 430-436, 2012.

[28] K. B. Korb and A. E. Nicholson, “Bayesian Artificial Intelligence.”Chapman and Hall/CRC, 2004.

[29] P. E. Shrout and J. L. Fleiss, “Intraclass correlations: Uses in assessingrater reliability”, Psychological Bulletin, vol 86(2),pp.420-428, Mar1979

[30] http://www.engr.du.edu/mmahoor/disfa.htm

A Uniﬁed Probabilistic Framework For Measuring …cvrl/Publication/pdf/Li2013b.pdfA Uniﬁed...

Documents

Transcript of A Uniﬁed Probabilistic Framework For Measuring …cvrl/Publication/pdf/Li2013b.pdfA Uniﬁed...