IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1 Dynamic...

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1

Dynamic Cascades with BidirectionalBootstrapping for Action Unit Detection in

Spontaneous Facial BehaviorYunfeng Zhu, Fernando De la Torre, Jeffrey F. Cohn, Associate Member, IEEE,

and Yu-Jin Zhang, Senior Member, IEEE

Abstract—Automatic facial action unit detection from video is a long standing problem in facial expression analysis. Researchhas focused on registration, choice of features, and classifiers. A relatively neglected problem is the choice of training images.Nearly all previous work uses one or the other of two standard approaches. One approach assigns peak frames to the positiveclass and frames associated with other actions to the negative class. This approach maximizes differences between positiveand negative classes, but results in a large imbalance between them, especially for infrequent AUs. The other approach reducesimbalance in class membership by including all target frames from onsets to offsets in the positive class. However, becauseframes near onsets and offsets often differ little from those that precede them, this approach can dramatically increase falsepositives. We propose a novel alternative, dynamic cascades with bidirectional bootstrapping (DCBB) to select training samples.Using an iterative approach, DCBB optimally selects positive and negative samples in the training data. Using Cascade Adaboostas basic classifier, DCBB exploits the advantages of feature selection, efficiency, and robustness of Cascade Adaboost. Toprovide a real-world test, we used the RU-FACS (a.k.a. M3) database of non-posed behavior recorded during interviews. Formost tested action units, DCBB improved AU detection relative to alternative approaches.

Index Terms—Facial expression analysis, action unit detection, FACS, dynamic cascade boosting, bidirectional bootstrapping.

✦

1 INTRODUCTION

The face is one of the most powerful channels of nonver-bal communication. Facial expression provides cues aboutemotional response, regulates interpersonal behavior, andcommunicates aspects of psychopathology. To make useof the information afforded by facial expression, Ekmanand Friesen [1] proposed the Facial Action Coding System(FACS). FACS segments the visible effects of facial muscleactivation into “action units (AUs)”. Each action unit isrelated to one or more facial muscles. These anatomicunits may be combined to represent more molar facialexpressions. Emotion-specified joy, for instance, is repre-sented by the combination of AU6 (cheek raiser, whichresults from contraction of the orbicularis occuli muscle)and AU12 (lip-corner puller, which results from contractionof the zygomatic major muscle). The FACS taxonomywas developed by manually observing live and recordedfacial behavior and by recording the electrical activity ofunderlying facial muscles [2]. Because of its descriptivepower, FACS has become the state of the art in manual

This work was supported in part National Institute of HealthGrant51435.The first author was also partially supported by a scholarship from ChinaScholarship Council.Yunfeng Zhu is in the Department of Electronic Engineering at TsinghuaUniversity, Beijing 100084 China. mail: [email protected] De la Torre, is in the Robotics Institute at Carnegie MellonUniversity, Pittsburgh, Pennsylvania 15213 USA. mail: [email protected] F. Cohn is in the Department of Psychology at University ofPittsburgh, Pittsburgh, Pennsylvania 15260 USA. mail: [email protected] Zhang is in the Department of Electronic Engineeringat TsinghuaUniversity, Beijing 100084 China. mail: [email protected]

time

AU12 Positive Bootstrap

Negative Bootstrap Negative Bootstrap

Strong AU frames

Ambiguous AU frames Subtle AU frames

AU peak AU onset

AU offset

Fig. 1. Example of strong, subtle, and ambiguoussamples of FACS action unit 12. Strong samples typ-ically correspond to the peak, or maximum intensity,and ambiguous frames correspond to AU onset andoffset and frames proximal to them. Subtle samples arelocated between strong and ambiguous ones. Usingan iterative algorithm, DCBB selects positive samplessuch that the detection accuracy is optimized duringtraining.

measurement of facial expression [3] and is widely usedin studies of spontaneous facial behavior [4]. For theseand related reasons, much effort in automatic facial imageanalysis seeks to automatically recognize FACS action units[5–9].

Automatic detection of AUs from video is a challengingproblem for several reasons. Non-frontal pose and mod-erate to large head motion make facial image registrationdifficult, large variability occurs in the temporal scale of


facial gestures, individual differences occur in shape andappearance of facial features, many facial actions are inher-ently subtle, and the number of possible combinations of40+ individual action units numbers in the thousands. Morethan 7000 action unit combinations have been observed[4]. Previous efforts at AU detection have emphasizedtypes of features and classifiers. Features have includedshape and various appearance features, such as grayscalepixels, edges, and appearance (e.g., canonical appearance,Gabor, and SIFT descriptors). Classifiers have includedsupport vector machines (SVM) [10], boosting [11],hidden Markov models (HMM) [12] and dynamic Bayesiannetworks (DBN) [13] review much of this literature.

By contrast, little attention has been paid to the as-signment of video frames to positive and negative classes.Typically, assignment has been done in one of two ways.One assigns to the positive class those frames at the peak ofeach AU or proximal to it. Peak refers to frame of maximumintensity of an action unit between when it begins (“onset”)and when it ends (“offset”). The negative class then ischosen by randomly sampling other AUs, including AU0 or neutral. This approach suffers at least two drawbacks:(1) the number of training examples will often be small,which results in a large imbalance between positive andnegative frames; and (2) peak frames may provide too littlevariability to achieve good generalization. These problemsmay be circumvented by following an alternative approach;that is to include all frames from onset to offset in the pos-itive class. This approach improves the ratio of positive tonegative frames and increases representativeness of positiveexamples. The downside is confusability of positive andnegative classes. Onset and offset frames and many of thoseproximal or even further from them may be indistinguisablefrom the negative class. As a consequence, the number offalse positives can dramatically increase.

To address these issues, we propose an extension of Ad-aboost [14–16] called Dynamic Cascades with BidirectionalBootstrapping (DCBB). Fig. 1 illustrates the main idea.Having manually annotated FACS data with onset, peak,and offset, the question we address is how best to select theAU frames for the positive and negative class. Preliminaryresults for this work has been presented in [17].

In contrast to previous approaches to class assignment,DCBB automatically distinguishes between strong, subtle,and ambiguous frames for AU events of different intensity.Strong frames correspond to the peaks and the ones proxi-mal to them; ambiguous frames are proximal to onsets andoffsets; subtle frames occur between strong and ambiguousones. Strong and subtle frames are assigned to the positiveclass. By distinguishing between these three types, DCBBmaximizes the number of positive frames while reducingconfusability between positive and negative classes.

For high intensity AUs in comparison with low intensityAUs, the algorithm will select more frames for the positiveclass. Some of these frames may be similar in intensityto low intensity AUs. Similarly, if multiple peaks occur be-tween an onset and offset, DCBB assigns multiple segmentsto the positive class. See Fig. 1 for an example. Strong

and subtle but not ambiguous AU frames are assigned tothe positive class. For the negative class, DCBB proposesa mechanism, which is similar as Cascade AdaBoost tooptimize that as well, the principles are that the weightof misclassified negative class will be increased duringtraining step of each weak classifier, and don’t learning tomuch at each cascade stage. Moreover, the positive class ischanged at each iteration, while the corresponding negativeclass is reselected again.

In experiments, we evaluated the validity of our approachto class assignment and selection of features. In the firstexperiment, we illustrate the importance of selecting theright positive samples for action unit detection. In thesecond we compare DCBB with standard approaches basedon SVM and AdaBoost.

The rest of the paper is organized as follows. Section2 reviews previous work on automatic methods for actionunit detection. Section3 describes pre-processing stepsfor alignment and feature extraction. Section4 gives de-tails of our proposed DCBB method. Section5 providesexperimental results in non-posed, naturalistic video. Forexperimental evaluation, we used FACS-coded interviewsfrom the RU-FACS (a.k.a. M3) database [18, 19]. Formost action units tested, DCBB outperformed alternativeapproaches.

2 PREVIOUS WORK

This section describes previous work on FACS and onautomatic detection of AUs from video.

2.1 FACS

The Facial Action Coding System (FACS) [1] is a compre-hensive, anatomically-based system for measuring nearlyall visually discernible facial movement. FACS describesfacial activity on the basis of 44 unique action units(AUs),as well as several categories of head and eye positions andmovements. Facial movement is thus described in terms ofconstituent components, or AUs. Any facial expression maybe represented as a single AU or a combination of AUs.For example, the felt, or Duchenne, smile is indicated bymovement of the zygomatic major (AU12) and orbicularisoculi, pars lateralis (AU6). FACS is recognized as themost comprehensive and objective means for measuringfacial movement currently available, and it has becomethe standard for facial measurement in behavioral researchin psychology and related fields. FACS coding proceduresallow for coding of the intensity of each facial action on a 5-point intensity scale, which provides a metric for the degreeof muscular contraction and for measurement of the timingof facial actions. FACS scoring produces a list of AU-baseddescriptions of each facial event in a video record. Fig. 2shows an example for AU12.

2.2 Automatic FACS detection from video

Two main streams in the current research on automaticanalysis of facial expressions consider emotion-specified


Fig. 2. FACS coding typically involves frame-by-frameinspection of the video, paying close attention to tran-sient cues such as wrinkles, bulges, and furrows todetermine which facial action units have occurred andtheir intensity. Full labeling requires marking onset,peak and offset and may include annotating changesin intensity as well. Left to right, evolution of an AU 12(involved in smiling), from onset, peak, to offset.

expressions (e.g., happy or sad) and anatomically basedfacial actions (e.g., FACS). The pioneering work of Blackand Yacoob [20] recognizes facial expressions by fittinglocal parametric motion models to regions of the face andthen feeding the resulting parameters to a nearest neighborclassifier for expression recognition. De la Torre et al. [21]used condensation and appearance models to simultane-ously track and recognize facial expression. Chang et al.[22] learned a low dimensional Lipschitz embedding tobuild a manifold of shape variation across several peopleand then used I-condensation to simultaneously track andrecognize expressions. Lee and Elgammal [23] employedmulti-linear models to construct a non-linear manifold thatfactorizes identity from expression.

Several promising prototype systems were reported thatcan recognize deliberately produced AUs in either nearfrontal view face images (Bartlett et al., [24]; Tian etal., [8]; Pantic & Rothkrantz, [25]) or profile view faceimages (Pantic & Patras, [26]). Although high scores havebeen achieved on posed facial action behavior [13, 27, 28],accuracy tends to be lower in the few studies that havetested classifiers on non-posed facial behavior [11, 29, 30].In non-posed facial behavior, non-frontal views and rigidhead motion are common, and action units are often lessintense, have different timing, and occur in complex com-binations [31]. These factors have been found to reduce AUdetection accuracy [32]. Non-posed facial behavior is morerepresentative of facial actions that occur in real life, whichis our focus in the current paper.

Most work in automatic analysis of facial expressionsdiffers in the choice of facial features, representations,and classifiers. Barlett et al. [11, 19, 24] used SVM andAdaBoost in texture-based image representations to recog-nize 20 action units in near-frontal posed and non-posedfacial behavior. Valstar and Pantic [26, 33, 34] proposed asystem that enables fully automated robust facial expressionrecognition and temporal segmentation of onset, peak, andoffset from video of mostly frontal faces. The systemincluded particle filtering to track facial features, Gabor-based representations, and a combination of SVM and

725 1137 1366

2297 2968 3258

Fig. 3. AAM tracking across several frames

HMM to temporally segment and recognize action units.Lucey et al. [30, 35] compared the use of different shapeand appearance representations and different registrationmechanisms for AU detection.

Tong et al. [13] used Dynamic Bayesian Networks withappearance features to detect facial action units in posedfacial behavior. The correlation among action units servedas priors in action unit detection. Comprehensive reviewsof automatic facial coding may be found in [5–8, 36].

To the best of our knowledge, no previous work hasconsidered strategies for selecting training samples or eval-uated their importance in AU detection. This is the firstpaper to propose an approach to optimize the selection ofpositive and negative training samples. Our findings suggestthat a principled approach to optimizing the selection oftraining samples increases accuracy of AU detection relativeto current state of the art.

3 FACIAL FEATURE TRACKING AND IMAGEFEATURES

This section describes the system for facial feature trackingusing active appearance models (AAMs), and extractionand representation of shape and appearance features forinput to the classifiers.

3.1 Facial tracking and alignment

Over the last decade, appearance models have becomeincreasingly important in computer vision and graphics.Parameterized Appearance Models (PAMs) (e.g. ActiveAppearance Models [37–39] and Morphable Models [40])have been proven useful for detection, facial feature align-ment, and face synthesis. In particular, Active AppearanceModels (AAMs) have proven an excellent tool for aligningfacial features with respect to a shape and appearancemodel. In our case, the AAM is composed of66 landmarksthat deform to fit perturbations in facial features. Person-specific models were trained on approximately5% of thevideo [39]. Fig. 3 shows an example of AAM trackingof facial features in a single subject from the RU-FACS[18, 19] video data-set.

After tracking facial features using AAM, a similaritytransform registers facial features with respect to an averageface (see middle column in Fig. 4). In the experimentsreported here, the face was normalized to212×212 pixels.


Similarity

Transformation

Backward

Mapping

Piece-wise

Affine

Warpping

Fig. 4. Two-step alignment

To extract appearance representations in areas that havenot been explicitly tracked (e.g. nasolabial furrow), weuse a backward piece-wise affine warp with Delaunaytriangulation to set up the correspondence. Fig. 4 shows thetwo step process for registering the face to a canonical posefor AU detection. Purple squares represent tracked pointsand blue dots represent meaningful non-tracked points. Thedashed blue line shows the mapping between a point inthe mean shape and its corresponding point in the originalimage. This two-step registration proves important towarddetecting low intensity action units.

3.2 Appearance features

Appearance features for AU detection [11, 41] outper-formed shape only features for some action units; see Luceyet al. [35, 42, 43] for a comparison. In this section, weexplore the use of the SIFT [44] descriptors as appearancefeatures.

Given feature points tracked with AAMs, SIFT descrip-tors are first computed around the points of interest. SIFTdescriptors are computed from the gradient vector for eachpixel in the neighborhood to build a normalized histogramof gradient directions. For each pixel within a subregion,SIFT descriptors add the pixel’s gradient vector to a his-togram of gradient directions by quantizing each orientationto one of 8 directions and weighting the contribution of eachvector by its magnitude.

4 DYNAMIC CASCADES WITH BIDIREC-TIONAL BOOTSTRAPPING (DCBB)This section explores the use of a dynamic boosting tech-niques to select the positive and negative samples thatimprove detection performance in AU detection.

Bootstrapping [45] is a resampling method that is com-patible with many learning algorithms. During the boot-strapping process, the active sets of negative examplesare extended by examples that were misclassified by the

Active

positive set

Active

negative set

Training

Dynamic Cascade Detector

Negative Pool Potential Positive

BootstrapSpreading

with Constrains

Final AU

Detector

AU peaksUpdate Weight

Q P

P0Pw

Fig. 5. Bidirectional Bootstrapping.

current classifier. In this section, we propose BidirectionalBootstrapping, a method to bootstrap both positive andnegative samples.

Bidirectional Bootstrapping beings by selecting as posi-tive samples only the peak frames and uses Classificationand Regression Tree (CART) [46] as a weak classifier (Ini-tial learning in Section4.1). After this initial training step,Bidirectional Bootstrapping extends the positive samplesfrom the peak frames to proximal frames and redefines newprovisional positive and negative training sets (Dynamiclearning in Section4.2). The positive set is extended by in-cluding samples that are classified correctly by the previousstrong classifier (Cascade AdaBoost in our algorithm), thenegative set is extended by examples misclassified by thesame strong classifier, thus emphasizing negative samplesclose to the decision boundary. With the bootstrapping ofpositive samples, the generalization ability of the classifieris gradually enhanced. The active positive and negativesets then are used as an input to the CART that returnsa hypothesis, which updates the weights in the manner ofGentle AdaBoost [47], and the training continues until thevariation between previous and current Cascade AdaBoostbecome smaller than a defined threshold. Fig. 5 illustratesthe process. In Fig. 5P is potential positive data set,Q isnegative data set (Negative Pool),P0 is the positive set inInitial learning step,Pw is the active positive set in eachiteration, the size of solid circle illustrate the intensity ofAU samples, the right ellipses illustrate the spreading ofdynamic positive set. See details in Algorithm1 and2.

4.1 Initial training step

This section explains the initial training step for DCBB.In the initial training step we select the peaks and the twoneighboring samples as positive samples, and a randomlyselected sample of other AUs and non-AUs as negativesamples. As in standard AdaBoost [14], we define the falsepositive target ratio (Fr), the maximum acceptable falsepositive ratio per cascade stage (fr), and the minimumacceptable true positive ratio per cascade stage (dr). Theinitial training step applies standard AdaBoost using CART[46] as a weak classifier as summarized in Algorithm 1.


Input:• Positive data setP0 (contains AU peak framesp and

p± 1);• Negative data setQ (contains other AUs and non-

AUs);• Target false positive ratioFr;• Maximum acceptable false positive ratio per cascade

stagefr;• Minimum acceptable true positive ratio per cascade

stagedr;Initialize:

• Current cascade stage numbert = 0;• Current overall cascade classifier’s true positive ratio

Dt = 1.0;• Current overall cascade classifier’s false positive ratio

Ft = 1.0;• S0 = {P0, Q0} is the initial working set,Q0 ⊂ Q.

The number of positive samples isNp. The numberof negative samples isNq = Np ×R0, R0 = 8;

While Ft > Fr

1) t = t + 1;ft = 1.0; Normalize the weightsωt,i foreach samplexi to guarantee thatωt = {ωt,i} is adistribution.

2) While ft > fra) For each featureφm, train a weak classifier on

S0 and find the best featureφi (the one with theminimum classification error).

b) Add the featureφi into the strong classifierHt,update the weight in Gentle AdaBoost manner.

c) Evaluate onS0 with the current strong classifierHt, adjust the rejection threshold under theconstraint that the true positive ratio does notdrop belowdr.

d) Decrease threshold untildr is reached.e) Computeft under this threshold.

END While3) Ft+1 = Ft × ft; Dt+1 = Dt × dr; keep inQ0 the

negative samples that the current strong classifierHt

misclassified (current false positive samples), recordits size asKfq.

4) Use the detectorHt to bootstrap false positive sam-ples from negativeQ randomly and repeat until thenegative working set hasNq samples.

END WhileOutput:

A t-levels cascade where each level has a strong boostedclassifier with a set of rejection thresholds for each weakclassifier.

Algorithm 1: Initial learning

4.2 Dynamic learning

Once a cascade of peak frame detectors is learned in theinitial learning stage (Section4.1), we are able to enlargethe positive set to increase the discriminative performanceof the whole classifier. The AU frames detector will become

stronger as new AU positive samples are added for training.We added additional constraint to avoid adding ambiguousAU frames to the dynamic positive set. The algorithm issummarized in Algorithm 2.

Input

• Cascade detectorH0, from the Initial Learning step;• Dynamic working setSD = {PD, QD};• All the frames in this action unit are represented as

potential positive samples in the setP = {Ps, Pv}.Ps contains the strong positive samples,P0 containspeak related samples described above,P0 ∈ Ps. Pv

contains ambiguous positive samples;• A large negative data setQ contains all other AUs and

non-AUs and its size isNa.

Update positive working set by spreading within P andupdate negative working set by bootstrap in Q with thedynamic cascade learning process:

Initialize: We set the value ofNp as the size ofP0. The sizeof the old positive data set isNp old = 0 and the diffusionstage ist = 1.

While (Np −Np old)/Np > 0.1

1) AU Positive Spreading: Np old = Np. Use thecurrent detector on the data setP to potentiallyadd more positive samples,Psp are all the positivesamples that are determined by the cascade classifierHt−1.

2) Hard Constrain in spreading: k indexes the currentAU event andi is the index to the current frame inthis AU event. Calculate the similarity values (Eq.1) between the peak frame in eventk and all peakframes with the lowest intensity value ‘A’, and denotethe average similarity value withSk. Calculate thesimilarity value between framei and peak frame ineventk, its value isSki, if Ski < 0.5× Sk, frameiwill be excluded fromPsp.

3) After the above step, the remaining positive work setis Pw = Psp, Np = size of Psp. Using theHt−1

detector to bootstrap false positive samples from thenegative setQ until the negative working setQw hasNq = β×R0 ×Na samples,Na is different in AUs.

4) Train the cascade classifierHt with the dynamicworking set{Pw, Qw}. As Rt varies, the maximumacceptable false positive ratio per cascade stagefmr

also becomes smaller (Eq. 2).5) t = t+ 1; emptyPw andQw.

END While

Algorithm 2: Dynamic learning

The function to measure the similarity between the peakand other frames is the Radial Basis function between the


appearance representation of two frames:

Sk =1

n

n∑

j=1

Sim(fk,fj), j ∈ [1 : n]

Sik = Sim(fi,fk) = e−(Dist(i,k)/max(Dist(:,k)))2

Dist(i, k) =

[ m∑

j=1

(fkj − fij)2

]1/2

, j ∈ [1 : m] (1)

n refers to the total number of AU instances with intensity‘A’, and m is the length of the AU features.

The dynamic positive working set becomes larger butthe negative samples pool is finite, sofmr

needs to bechanged dynamically. Moreover,Nq is function of Na

because different AU has different size of the negativesamples pool. Some AUs (e.g., AU12) are likely to occurmore often than others. Rather than tuning these thresholdsone by one, we assume that the false positive ratiofmr

changes exponentially in each staget, that is:

fmr= fr × (1− e−αRt) (2)

In our experimental set up, we setα as 0.2 andβ as0.04 respectively. We found empirically that those valuesare suitable for all the AUs to avoid lack of useful negativesamples in RU-FACS database. After the spreading stage,the ratio between positive and negative samples becomesbalanced, except for some rare AUs (e.g., AU4, AU10)where the unbalance is due to the scarceness of positiveframes in the database.

5 EXPERIMENTS

DCBB iteratively samples training frames and then usesCascade AdaBoost for classification. To evaluate the effi-cacy of iteratively sampling training samples, Experiment1 compared DCBB with the two standard approaches.They are selecting only peaks and alternatively selectingall frames between onsets and offsets. Experiment 1 thusevaluated whether iteratively sampling training images in-creased AU detection. Experiment 2 evaluated the efficacyof Cascade AdaBoost relative to SVM when iterativelysampling training samples. In our implementation, DCBBuses Cascade AdaBoost, but other classifiers might be usedinstead. Experiment 2 informs whether better results mightbe achieved using SVM. Experiment 3 explored the use ofthree appearance descriptions (Gabor, SIFT and DAISY) inconjunction with DCBB (Cascade Adaboost as classifier).

5.1 Database

The three experiments all used RU-FACS (a.k.a. M3)database [18, 19]. RU-FACS consists of video-recordedinterviews of 34 men and women of varying ethnicity.Interviews were approximately two minutes in duration.Video from five subjects could not be processed for tech-nical reasons (e.g., noisy video), which resulted in usabledata from 29 participants. Meta-data included manual FACScodes for AU onsets, peaks and offsets. Because someAUs occurred too infrequently, we selected the 13 AUs

that occur more than20 times in the database (i.e. 20or more peaks). These AUs are: AU1, AU2, AU4, AU5,AU6, AU7, AU10, AU12, AU14, AU15, AU17, AU18,and AU23. Fig. 6 shows the number of frames that eachAU occurred and their average duration. Fig. 7 illustratesa representative time series for several AUs from subjectS010. Blue asterisks represent onsets, red circles peakframes, and green plus signs the offset frames. In eachexperiment, we randomly selected19 subjects for trainingand the other10 subjects for testing.

Fig. 6. Top) Frame number from onset to offset for the13 most frequent AUs in the M3 dataset. Bottom) Theaverage duration of AU in frames.

5.2 Experiment 1: Iteratively sampling training im-ages with DCBB

This experiment illustrates the effect of iteratively selectingpositive and negative samples (i.e. frames) while trainingthe Cascade AdaBoost. As a sample selection mechanismon top of the Cascade AdaBoost, DCBB could be appliedto other classifiers as well. The experiment investigateswhether the iteratively selecting training samples strategyis better than the strategy using only peak AU frames or thestrategy using all the AU frames as positive samples. Fordifferent positive samples assignments, the negative sam-ples are defined by the method used in Cascade AdaBoost.

We apply the DCBB method, described in Section4 anduse appearance features based on SIFT descriptors (Section3). For all AUs the SIFT descriptors are built using a squareof 48 × 48 pixels for twenty feature points for the lowerface AUs or sixteen feature points for upper face (see Fig.4). We trained13 dynamic cascade classifiers, one for eachAU, as described in Section4.2, using a one versus allscheme for each AU.

Top of Fig. 8 shows the manual labeling for AU12 of thesubjectS015. We can see eight instances of AU12 with


AU1

AU2

AU10

AU12

AU14

AU17

Fig. 7. AUs characteristics in subject S010, duration, intensity, onset, offset, peak.(frames as unit in X axis, Yaxis is the intensity of AUs)

87

1 2 3 4 5 6 7 8

2 3 41

5 6

Fig. 8. The spreading of positive samples during each dynamic training step for AU12. See text for theexplanation of the graphics.

varying intensities ranging from A (weak) to E (strong).The black curve in bottom figures represent the similarity(eq. 1) between the peak and the neighboring frames. Thepeak is the maximum of the curve. The positive samplesin the first step are represented by green asterisks, in thesecond iteration by red crosses, in the third iteration byblue crosses, and in the final iteration by black circles.Observe that in the case of high peak intensity, subfigures3 and 8 (top right number in the similarity plots), the finalselected positive samples contain areas of low similarityvalues. When AU intensity is low, subfigure 7, positivesamples are selected if they have a high similarity with the

peak, which reduces to the number of false positives. Theellipses and rectangles in the figures contain frames thatare selected as positive samples, and correspond to strongand subtle AUs defined above. The triangles correspond toframes between the onset and offset that are not selectedas positive samples, and represent ambiguous AUs in Fig.1.

Table 1 shows the number of frames at each level ofintensity and the percentage of each intensity that wereselected as positive by DCBB in the training set. The letters‘A-E’ in left column refers to the level of AU intensities,The ‘Num’ and ‘Pct’ in the top rows refers to the number


TABLE 1Number of frames at each level of intensity and the percentage of each intensity that were selected as positive

by DCBB

AU1 AU2 AU4 AU5 AU6 AU7Num Pct Num Pct Num Pct Num Pct Num Pct Num Pct

A 986 46.3% 845 43.6% 325 37.1% 293 86.4% 119 40.2% 98 47.9%

B 3513 83.3% 3130 79.4% 912 70.4% 371 97.7% 1511 71.3% 705 93.4%

C 3645 90.9% 3241 91.3% 439 83.8% 110 99.7% 1851 87.8% 1388 97.9%

D 2312 90.3% 1753 95.6% 73 79.6% 0 NaN 1128 84.2% 537 98.7%

E 151 99.3% 601 96.5% 0 NaN 23 96.6% 118 91.4% 153 98.5%

AU10 AU12 AU14 AU15 AU17 AU18 AU23Num Pct Num Pct Num Pct Num Pct Num Pct Num Pct Num Pct

A 367 61.3% 1561 38.1% 309 49.7% 402 72.6% 445 34.1% 134 87.4% 25 77.0%

B 2131 90.8% 4293 73.3% 2240 93.9% 1053 89.7% 2553 70.1% 665 99.2% 190 87.8%

C 1066 92.8% 5432 82.0% 1040 95.6% 611 97.4% 1644 76.4% 240 98.7% 198 97.3%

D 572 88.6% 2234 83.5% 268 96.8% 68 98.6% 966 85.3% 83 98.6% 166 98.9%

E 0 NaN 1323 92.4% 232 94.3% 0 NaN 226 80.2% 0 NaN 361 99.1%

of AU frames at each level of intensity and the percentageof frames that were selected as positive examples at thelast iteration of dynamic learning step, respectively. If therewere no AU frames in one level of intensity, the ‘Pct’ valuewill be ‘NaN’. From this table, we can see that DCBBemphasizes intensity levels ‘C’, ‘D’ and ‘E’.

Fig. 9 shows the Receiver-Operator Characteristic (ROC)curves for testing data (subjects not in the training) usingDCBB. The ROC curves were obtained by plotting truepositives ratios against false positives ratios for differentdecision threshold values of the classifier. Results are shownfor each AU. In each figure, five or six ROC curves areshown:initial learning corresponds to training on only thepeaks (which is same as Cascade Adaboost without theDCBB strategy);spread xcorresponds to running DCBBx times; All denotes using all frames between onset andoffset. The first number between lines| in Fig. 9 denotesthe area under the ROC; the second number is the size ofpositive samples in the testing dataset; and separated by/ isthe number of negative samples in the testing dataset. Thethird number denotes the size of positive samples in trainingworking sets and separated by/ the total frames of targetAU in training data sets. AU5 has the minimum numberof training examples and AU12 has the largest number ofexamples. We can observed that the area under the ROC forframe-by-frame detection improved gradually during eachlearning stage; performance improved faster for AU4, AU5,AU10, AU14, AU15, AU17, AU18, AU23 than for AU1,AU2, AU6 and AU12 during Dynamic learning. Note thatusing all the frames between the onset and offset (‘All’)typically degraded detection performance. The table belowFig. 9 shows the areas under the ROC when only peakis selected for training, when all frames are selected andDCBB is used. DCBB outperformed both alternatives.

Two parameters in the DCBB algorithms were manuallytuned and remained the same in all experiments. Oneparameter specifies the minimum similarity value belowwhich no positive sample will be selected. The threshold isbased on the similarity equation (eq. 1). It was set at 0.5after preliminary testing found results stable within a range

of 0.2 to 0.6. The second parameter is the stopping criterion.We consider that the algorithm has converged when thenumber of new positive samples between iterations is lessthan10%. In preliminary experiments, values less than15%failed to change detection performance as indicated by ROCcurves. Using the stopping criterion, the algorithm typicallyconverged within three or four iterations.

5.3 Experiment 2: Comparing Cascade Adaboostand Support Vector Machine (SVM) when used withiteratively sampled training frames

SVM and AdaBoost are two commonly used classifiersfor AU detection. In this experiment, we compared AUdetection by DCBB (Cascade AdaBoost as classifier) andSVM using shape and appearance features. We found thatfor both types of features (i.e. shape and appearance),DCBB achieved more accurate AU detection.

For compatibility with the previous experiment, datafrom the same19 subjects as above were used for trainingand the other10 for testing. Results are reported for the13 most frequently observed AU. Other AU occurred tooinfrequently (i.e. fewer than 20 occurrences) to obtainreliable results and thus were omitted. Classifiers for eachAU were trained using a one versus-all strategy. The ROCcurves for the13 AUs are shown in Fig. 10. For eachAU, six curves are shown, one for each combination oftraining features and classifiers. ‘App+DCBB’ refers toDCBB using appearance features; ‘Peak+Shp+SVM’ refersto SVM using shape features trained on the peak frames(and two adjacent frames) [10]; ‘Peak+App+SVM’ [10]refers to train a SVM using appearance features trained onthe peak frame (and two adjacent frames); ‘All+Shp+SVM’refers to SVM using shape features trained on all framesbetween onset and offset; ‘All+PCA+App+SVM’ refers toSVM using appearance features (after PCA processing)trained on all frames between onset and offset, here, inorder to computationally scale in memory space, we re-duced the dimensionality of the appearance features usingprincipal component analysis (PCA) that preserves98%of the energy. ‘Peak+App+Cascade Boost’ refers to use


AU12

AU4

AU6

AU15

AU1 AU2

AU7 AU10

AU14 AU17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive ratio

tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.53 | 409/ 56912 | 216/ 1749

Spread 1 | 0.65 | 409/ 56912 | 701/ 1749

Spread 2 | 0.76 | 409/ 56912 | 1078/ 1749

Spread 3 | 0.76 | 409/ 56912 | 1190/ 1749

All+Boost | 0.58 | 409/ 56912 | 1749/ 1749

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learing | 0.71 | 3801/ 53520 | 813/ 9570

Spread 1 | 0.71 | 3801/ 53520 | 5614/ 9570

Spread 2 | 0.72 | 3801/ 53520 | 7432/ 9570

Spread 3 | 0.75 | 3801/ 53520 | 8069/ 9570

All+Boost | 0.71 | 3801/ 53520 | 9570/ 9570

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.56 | 1721 / 55600 | 174/ 2881

Spread 1 | 0.56 | 1721 / 55600 | 850/ 2881

Spread 2 | 0.64 | 1721 / 55600 | 2210/ 2881

Spread 3 | 0.72 | 1721 / 55600 | 2575/ 2881

Spread 4 | 0.69 | 1721 / 55600 | 2745/ 2881

All+Boost | 0.66 | 1721/ 55600 | 2881/ 2881

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.62 | 924/ 56397 | 1443/ 4136

Spread 2 | 0.66 | 924/ 56397 | 2779/ 4136

Spread 3 | 0.72 | 924/ 56397 | 3467/ 4136

Spread 4 | 0.72 | 924/ 56397 | 3656/ 4136

All+Boost | 0.64 | 924/ 56397 | 4136/ 4136

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.83 | 12696 / 44625 | 446/ 14843

Spread 1 | 0.89 | 12696 / 44625 | 7059/ 14843

Spread 2 | 0.90 | 12696 / 44625 | 10239/ 14843

Spread 3 | 0.92 | 12696 / 44625 | 11283/ 14843

All+Boost | 0.90 | 12696/ 44625 | 14843/ 14843

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.70 | 3494/ 53827 | 2861/ 4089

Spread 2 | 0.72 | 3494/ 53827 | 3476/ 4089

Spread 3 | 0.72 | 3494/ 53827 | 3729/ 4089

All+Boost | 0.69 | 3494/ 53827 | 4089/ 4089

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.81 | 1331/ 55990 | 1219/ 2134

Spread 2 | 0.86 | 1331/ 55990 | 1746/ 2134

Spread 3 | 0.86 | 1331/ 55990 | 1899/ 2134

All+Boost | 0.83 | 1331/ 55990 | 2134/ 2134

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.76| 3774/ 53547 | 5536/ 10607

Spread 2 | 0.77| 3774/ 53547 | 8103/ 10607

Spread 3 | 0.76 | 3774/ 53547 | 8934/ 10607

All+Boost | 0.69 | 3774/ 53547 |10607/ 10607

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Init+Boost | 0.59 | 2620/ 54701 | 530/ 5843

Spread 1 | 0.53 | 2620/ 54701 | 1971/ 5843

Spread 2 | 0.80 | 2620/ 54701 | 4093/ 5843

Spread 3 | 0.81 | 2620/ 54701 | 4202/ 5843

All+Boost | 0.80 | 2620/ 54701 | 5843/ 5843

AU1 AU10AU7AU6AU4AU2 AU12 AU14 AU15 AU17

Peak

DCBB

All

0.75

0.69

0.76

0.71

0.71

0.75

0.53 0.93 0.56 0.52 0.83 0.50 0.52 0.59

0.58 0.95 0.66 0.64 0.90 0.69 0.83 0.80

0.76 0.97 0.69 0.72 0.92 0.72 0.86 0.81

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.94 | 2578/ 54743 | 1867/ 4727

Spread 2 | 0.96 | 2578/ 54743 | 3112/ 4727

Spread 3 | 0.96 | 2578/ 54743 | 3626/ 4727

Spread 4 | 0.97 | 2578/ 54743 | 3808/ 4727

All+Boost | 0.95 | 2578/ 54743 | 4727/ 4727

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.49| 667/ 56654 | 87/ 797

Spread 1 | 0.71| 667/ 56654 | 592/ 797

Spread 2 | 0.77| 667/ 56654 | 748/ 797

All+Boost | 0.75| 667/ 56654 | 797/ 797

AU5

AU18

AU23

AU5

0.49

0.75

0.77

AU18

0.57

0.77

0.86

AU23

0.34

0.40

0.75

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Init Learning | 0.57 | 680/ 56641 | 449/ 1122

Spread 1 | 0.60 | 680/ 56641 | 1045/ 1122

Spread 2 | 0.86 | 680/ 56641 | 1096/ 1122

All+Boost | 0.77 | 680/ 56641 | 1122/1122

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Init Learning | 0.34 | 885/ 56436 | 108/ 940

Spread 1 | 0.75 | 885/ 56436 | 901/940

All+Boost | 0.40 | 885/ 56436 | 940/ 940

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Fig. 9. The ROCs improve with the spreading of positive samples: See text for the explanation of Peak, Spreadx and All.


AU1 AU2 AU4

AU6 AU7 AU10 AU12

AU14 AU15 AU17

AU1 AU10AU7AU6AU4AU2 AU12 AU14 AU15 AU17

Peak+App+

Cascade Boost

DCBB

Peak+Shp+SVM 0.71

0.43

0.76

0.62

0.45

0.75

0.76 0.93 0.64 0.61 0.89 0.57 0.73 0.66

0.61 0.77 0.67 0.63 0.90 0.50 0.69 0.53

0.76 0.97 0.69 0.72 0.92 0.72 0.86 0.81

0.75 0.71 0.53 0.93 0.56 0.52 0.83 0.50 0.52 0.59

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.71 | 3774/ 53547

All+Shp+SVM | 0.68 | 3774/ 53547

Peak+App+SVM | 0.43 | 3774/ 53547

All+PCA+App+SVM | 0.74 | 3774/ 53547

Peak+App+Cascade Boost | 0.75 | 3774/ 53547

App+DCBB | 0.76 | 3774/ 53547

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.62 | 3801/ 53520

All+Shp+SVM | 0.65 | 3801/ 53520

Peak+App+SVM | 0.45 | 3801/ 53520

All+PCA+App+SVM | 0.74 | 3801/ 53520


App+DCBB | 0.75 | 3801/ 53520

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.76 | 409/ 56912

All+Shp+SVM | 0.78 | 409/ 56912

Peak+App+SVM | 0.61 | 409/ 56912

All+PCA+App+SVM | 0.85 | 409/ 56912


App+DCBB | 0.76 | 409/ 56912

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.93 | 2578/ 54743

All+Shp+SVM | 0.88| 2578/ 54743

Peak+App+SVM | 0.77 | 2578/ 54743

All+PCA+App+SVM | 0.96| 2578/ 54743


App+DCBB | 0.97 | 2578/ 54743

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.73 | 1331/ 55990

All+Shp+SVM | 0.61 | 1331/ 55990

Peak+App+SVM | 0.69 | 1331/ 55990

All+PCA+App+SVM | 0.86 | 1331/ 55990


App+DCBB | 0.86 | 1331/ 55990

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.64 | 1721/ 55600

All+Shp+SVM | 0.64 | 1721/ 55600

Peak+App+SVM | 0.67 | 1721/ 55600

All+PCA+App+SVM | 0.63 | 1721/ 55600


App+DCBB | 0.69 | 1721/ 55600

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.61 | 924/ 56397

All+Shp+SVM | 0.67 | 924/ 56397

Peak+App+SVM | 0.63 | 924/ 56397

All+PCA+App+SVM | 0.54 | 924/ 56397


App+DCBB | 0.72 | 924/ 56397

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fasle positive ratio

tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.89 | 12696/ 44625

All+Shp+SVM | 0.77 | 12696/ 44625

Peak+App+SVM | 0.90 | 12696/ 44625

All+PCA+App+SVM | 0.91 | 12696/ 44625


App+DCBB | 0.92 | 12696/ 44625

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.57 | 3494/ 53827

All+Shp+SVM | 0.72 | 3494/ 53827

App+SVM | 0.50 | 3494/ 53827

Peak+App+SVM | 0.82 | 3494/ 53827


App+DCBB | 0.72 | 3494/ 53827

All+Shp+SVM

Peak+App+SVM

0.68 0.65 0.78 0.88 0.64 0.67 0.77 0.72 0.61 0.72

0.74 0.74 0.85 0.96 0.63 0.54 0.91 0.82 0.86 0.78

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.58 | 667/ 56654

All+Shp+SVM | 0.55 | 667/ 56654

Peak+App+SVM | 0.86 | 667/ 56654

All+PCA+App+SVM | 0.89 | 667/ 56654


App+DCBB | 0.77 | 667/ 56654

AU5

AU18

AU23AU23AU18AU5

0.58

0.86

0.77

0.49

0.55

0.89

0.86

0.95

0.86

0.57

0.88

0.94

0.74

0.62

0.75

0.34

0.67

0.54

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.74 | 885/ 56436

All+Shp+SVM | 0.67 | 885/ 56436

Peak+App+SVM | 0.62 | 885/ 56436

Peak+PCA+App+SVM | 0.54 | 885/ 56436


App+DCBB | 0.75 | 885/ 56436

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.86 | 680/ 56641

All+Shp+SVM | 0.88 | 680/ 56641

Peak+App+SVM | 0.95 | 680/ 56641

Peak+PCA+App+SVM | 0.94 | 680/ 56641


App+DCBB | 0.86 | 680/ 56641

All+PCA+

App+SVM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.66 | 2620/ 54701

All+Shp+SVM | 0.72 | 2620/ 54701

Peak+App+SVM | 0.53 | 2620/ 54701

All+PCA+App+SVM | 0.78 | 2620/ 54701


App+DCBB | 0.81 | 2620/ 54701

Fig. 10. ROC curve for 13 AUs using six different methods: AU peak frames with shape features andSVM(Peak+Shp+SVM); all frames between onset and offset with shape features and SVM (All+Shp+SVM); AUpeak frames with appearance features and SVM (Peak+App+SVM); sampling 1 frame in every 4 frames betweenonset and offset with PCA to reduce appearance dimensionality and SVM (All+PCA+App+SVM); AU peakframes with appearance features and Cascade AdaBoost(Peak+App+Cascade Boost); DCBB with appearancefeatures(DCBB).


the peak frame with appearance features and CascadeAdaBoost [14] classifier (will be equivalent to the firststep in DCBB). As can be observed in the figure, DCBBoutperformed SVM for all AUs (except AU18) using eithershape or appearance when training in the peak (and twoadjacent frames). When training the SVM with shape andusing all samples, the DCBB performed better for elevenout of thirteen AUs. In the SVM training, the negativesamples were selected randomly (but the same negativesamples when using either shape or appearance features).The ratio between positive and negative samples was fixedto 30. Compared with the Cascade AdaBoost (first step inDCBB that only uses the peak and two neighbor samples),DCBB improved the performance in all AUs.

Interestingly, the performance for AU4, AU5, AU14,AU18 using the method ‘All+PCA+App+SVM’ was bet-ter than ‘DCBB’. ‘All+PCA+App+SVM’ uses appearancefeatures and all samples between onset and offset. Allparameters in SVM were selected using cross-validation. Itis interesting to observe that AU4, AU5, AU14, AU15 andAU18 are the AUs that have very few total training samples(only 1749 total frames for AU4,797 frames for AU5,4089frames for AU14,2134 frames for AU15,1122 framesfor AU18), and when having very little training data theclassifier can benefit from using all samples. Moreover, thePCA step can help to remove noise. It is worth pointing outthat best results were achieved by the classifiers using ap-pearance (SIFT) instead of shape features, which suggestsSIFT may be more robust to residual head pose variationin the normalized images. Using unoptimized MATLABcode, training DCBB typically required one to three hoursdepending on the AU and number of training samples.

5.4 Experiment 3: Appearance descriptors forDCBB

Our findings and those of others suggest that appearancefeatures are more robust than shape features in unposedvideo with small to moderate head motion. In the workdescribed above, we used SIFT to represent appearance. Inthis section, we compare SIFT to two alternative appearancerepresentations, DAISY and Gabor [24, 41].

Similar in spirit to SIFT descriptors, DAISY descriptorsare an efficient feature descriptor based on histograms. Theyhave frequently been used to match stereo images [48].DAISY descriptors use circular grids instead of the regulargrids in SIFT; the former have been found to have betterlocalization properties [49] and to outperform many state-of-the-art feature descriptors for sparse point matching [50].At each pixel, DAISY builds a vector made of valuesfrom the convolved orientation maps located on concentriccircles centered on the location. The amount of Gaussiansmoothing is proportional to the radius of the circles.

In the following experiment, we compare the perfor-mance on AU detection for three appearance representa-tions, Gabor, SIFT and DAISY using DCBB with Cas-cade Adaboost. Each was computed at the same locations(twenty feature points in lower face, see Fig. 12). The same

19 subjects as before were used for training and10 fortesting. Fig. 11 shows the ROC detection curves for thelower AUs using Gabor filters at eight different orientationsand five different scales, DAISY [50] and SIFT [44].For most of AUs, ROC curves for SIFT and DAISY werecomparable. With respect to processing speed, DAISY wasfaster.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

true p

ositiv

e r

ate

20 Points | 0.92 | 12696/ 44625

18 Points | 0.91 | 12696/ 44625

12 Points | 0.86 | 12696/ 44625

2 Points | 0.83 | 12696/ 44625

Fig. 12. Using different number of feature points

An important parameter largely conditioning the perfor-mance of appearance-based descriptors is the number andlocation of features selected to build the representation.Forcomputational considerations it is impractical to build ap-pearance representations in all possible regions of interest.To evaluate the influence of number of regions selected,we varied the number of regions, or points, in the moutharea from 2 to 20. As shown in Fig. 12, large red circlesrepresent the location of two feature regions, adding bluesquares and red circles increases the number to12; addingpurple triangles and small black squares increases the num-ber to18 and20 respectively. Comparing results for2, 12,18, and20 regions, or points, performance was consistentwith intuition. While there were some exceptions, in generalperformance improved monotonically with the number ofregions (Fig. 12).

6 CONCLUSIONS

An unexplored problem and critical to the success ofautomatic action unit detection is the selection of thepositive and negative training samples. This paper proposesdynamic cascade bidirectional bootstrapping (DCBB) forthis purpose. With few exceptions, DCBB achieved betterdetection performance than the standard approaches ofselecting either peak frames or all frames between theonsets and offsets. We also compared three commonlyused appearance features and the number of regions ofinterests or points to which they were applied. We foundthat use of SIFT and DAISY improved accuracy relative toGabor under same number of sampling points. For all three,increasing the number of regions or points monotonicallyimproved performance. These findings suggest that DCBBwhen used with appearance features, especially SIFT orDAISY, can improve AU detection relative to standardselection methods for selecting training samples.


AU10

AU12 AU14 AU15 AU17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift descriptor | 0.92 | 12696/ 44625

Daisy descriptor | 0.92 | 12696/ 44625

Gabor | 0.90 | 12696/ 44625

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

rate

Sift Descriptor | 0.72| 3494/ 53827

Daisy Descriptor | 0.74| 3494/ 53827

Gabor | 0.59| 3494/ 53827

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

ra

te

Sift descriptor | 0.86 | 1331/ 55990


Gabor | 0.77 | 1331/ 55990

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift Descriptor | 0.81 | 2620/ 54701

Daisy Descriptor | 0.77 | 2620/ 54701

Gabor | 0.73 | 2620/ 54701

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift descriptor | 0.72 | 924/ 56397


Gabor | 0.57 | 924/ 56397

++

++

+ +++

++ +

++++

++

Sift

Descriptor

Daisy

Descriptor

Gabor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

rate

Sift Descriptor | 0.75 | 3801/ 53520


Gabor | 0.67 | 3801/ 53520

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive r

ate

Sift Descriptor | 0.76 | 3774/ 53547


Gabor | 0.72 | 3774/ 53547

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive r

ate

Sift Descriptor | 0.76 | 409/ 56912


Gabor | 0.64 | 409/ 56912

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift Descriptor | 0.97 | 2578/ 54743


Gabor | 0.73 | 2578/ 54743

AU1 AU2 AU4

AU6 AU7

0.76 0.75 0.97

0.75 0.68 0.86

0.72 0.67 0.73

0.76

0.68

0.64

AU6 AU7

0.72 0.92 0.86

0.78 0.92 0.84

0.57 0.90 0.77

0.72

0.74

0.59

AU10 AU12 AU14 AU15 AU17

0.64

0.69

0.68

0.81

0.77

0.73

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift Descriptor | 0.64 | 1721/ 55600


Gabor | 0.68 | 1721/ 55600

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

rate


Daisy Descripter | 0.76| 667/ 56654

Gabor | 0.76| 667/ 56654

AU5

AU18

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive r

ate



Gabor | 0.56 | 885/ 56436

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate



Gabor | 0.84 | 680/ 56641

SIFT

DAISY

Gabor

AU1 AU2 AU4 AU5 AU18 AU23

0.86 0.750.77

0.87 0.60

0.560.84

0.76

0.76

AU23

Fig. 11. Comparison of ROC curves for 13 AUs using different appearance representations based on SIFT,DAISY, and Gabor representations

Several issues remain unsolved. The final Cascade Ad-aboost classifier has automatically selected the features andtraining samples that improve classification performance inthe training data. During the iterative training, we have anintermediate set of classifiers (usually from3 to 5) thatcould potentially be used to model the dynamic patternof AU events by measuring the amount of overlap in theresulting labels. Additionally, our sample selection strategycould be easily applied to other classifiers such as SVMor Gaussian Processes. Moreover, we plan to explore the

use of these techniques in other computer vision problemssuch activity recognition, where the selection of the positiveand negative samples might play an important role in theresults.

ACKNOWLEDGMENTS

The work was performed when the first author was atCarnegie Mellon University with support from NIH grant51435. Any opinions, findings and conclusions or recom-mendations expressed in this material are those of the


authors and do not necessarily reflect the views of theNational Institutes of Health. Thanks to Tomas Simon, FengZhou, Zengyin Zhang for helpful comments.

REFERENCES

[1] P. Ekman and W. Friesen, “Facial action coding system:A technique for the measurement of facial movement.”Consulting Psychologists Press., 1978.

[2] J. F. Cohn, Z. Ambadar, and P. Ekman,Observer-basedmeasurement of facial expression with the Facial ActionCoding System. New York: Oxford.: The handbook ofemotion elicitation and assessment. Oxford University PressSeries in Affective Science., 2007.

[3] J. F. Cohn and P. Ekman, “Measuring facial action by manualcoding, facial emg, and automatic facial image analysis.”Handbook of nonverbal behavior research methods in theaffective sciences., 2005.

[4] P. Ekman and E. L. Rosenberg,What the face reveals: Basicand applied studies of spontaneous expression using theFacial Action Coding System (FACS). Oxford UniversityPress, USA, 2005.

[5] M. Pantic, N. Sebe, J. F. Cohn, and T. Huang, “Affectivemultimodal human-computer interaction.” inACM Interna-tional Conference on Multimedia, 2005, pp. 669–676.

[6] W. Zhao and R. Chellappa,(Editors). Face Processing:Advanced Modeling and Methods. Elsevier, 2006.

[7] S. Li and A. Jain,Handbook of face recognition.New York:Springer., 2005.

[8] Y. Tian, J. F. Cohn, and T. Kanade,Facial expressionanalysis. In S. Z. Li and A. K. Jain (Eds.). Handbook offace recognition. New York, New York: Springer., 2005.

[9] Y. Tian, T. Kanade, and J. F. Cohn,Recognizing action unitsfor facial expression analysis, vol. 23, no. 1, pp. 97–115,2001.

[10] S. Lucey, A. B. Ashraf, and J. F. Cohn, “Investigating sponta-neous facial action recognition through aam representationsof the face,” in Face Recognition, K. D. . M. Grgic, Ed.Vienna: I-TECH Education and Publishing, 2007, pp. 275–286.

[11] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. R.Fasel, and J. R. Movellan, “Recognizing facial expression:Machine learning and application to spontaneous behavior,”in CVPR05, 2005, pp. 568–573.

[12] M. F. Valstar and M. Pantic, “Combined support vectormachines and hidden markov models for modeling facialaction temporal dynamics,” inICCV07., 2007, pp. 118–127.

[13] Y. Tong, W. H. Liao, and Q. Ji, “Facial action unit recogni-tion by exploiting their dynamic and semantic relationships,”IEEE TPAMI, pp. 1683–1699, 2007.

[14] P. Viola and M. Jones, “Rapid object detection using aboosted cascade of simple features,” inCVPR01, 2001, pp.511–518.

[15] R. Xiao, H. Zhu, H. Sun, and X. Tang, “Dynamic cascadesfor face detection,” inICCV07, 2007, pp. 1–8.

[16] S. Y. Yan, S. G. Shan, X. L. Chen, W. Gao, and J. Chen,“Matrix-structural learning (msl) of cascaded classifier fromenormous training set,” inCVPR07, 2007, pp. 1–7.

[17] Y. F. Zhu, F. De la Torre, and J. F. Cohn, “Dynamiccascades with bidirectional bootstrapping for spontaneousfacial action unit detection,” inAffective Computing andIntelligent Interaction (ACII), September 2009.

[18] M. G. Frank, M. S. Bartlett, and J. R. Movellan, “The m3database of spontaneous emotion expression.” 2011.

[19] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel,and J. Movellan, “Automatic recognition of facial actions inspontaneous expressions.”Journal of Multimedia, 2006.

[20] M. J. Black and Y. Yacoob, “Recognizing facial expressions

in image sequences using local parameterized models ofimage motion,”IJCV97, vol. 25, no. 1, pp. 23–48, 1997.

[21] F. De la Torre, Y. Yacoob, and L. Davis, “A probabilisitcframework for rigid and non-rigid appearance based track-ing and recognition,” inProceedings of the Seventh IEEEInternational Conference on Automatic Face and GestureRecognition (FG’00), 2000, pp. 491–498.

[22] Y. Chang, C. Hu, R. Feris, and M. Turk, “Manifold basedanalysis of facial expression,” inCVPRW04, 2004, pp. 81–89.

[23] C. Lee and A. Elgammal, “Facial expression analysis usingnonlinear decomposable generative models.” inIEEE Inter-national Workshop on Analysis and Modeling of Faces andGestures, 2005, pp. 17–31.

[24] M. Bartlett, G. Littlewort, I. Fasel, J. Chenu, and J. Movel-lan., “Fully automatic facial action recognition in spon-taneous behavior.” inProceedings of the Seventh IEEEInternational Conference on Automatic Face and GestureRecognition (FG’06), 2006, pp. 223–228.

[25] M. Pantic and L. Rothkrantz, “Facial action recognition forfacial expression analysis from static face images.”IEEETransactions on Systems, Man, and Cybernetics, pp. 1449–1461, 2004.

[26] M. Pantic and I. Patras, “Dynamics of Facial Expression:Recognition of Facial Actions and their Temporal Segmentsfrom Face Profile Image Sequences,”IEEE Transactionson Systems, Man, and Cybernetics - Part B: Cybernetics,vol. 36, pp. 433–449, 2006.

[27] P. Yang, Q. Liu, X. Cui, and D. Metaxas, “Facial expressionrecognition using encoded dynamic features.” inCVPR08,2008.

[28] Y. Sun and L. J. Yin, “Facial expression recognition basedon 3d dynamic range model sequences,” inECCV08, 2008,pp. 58–71.

[29] B. Braathen, M. Bartlett, G. Littlewort, and J. Movellan,“First steps towards automatic recognition of spontaneousfacial action units,” inProceedings of the ACM Conferenceon Perceptual User Interfaces, 2001.

[30] P. Lucey, J. F. Cohn, S. Lucey, S. Sridharan, andK. M. Prkachin, “Automatically detecting action units fromfaces of pain: Comparing shape and appearance features,”CVPRW09, pp. 12–18, 2009.

[31] J. F. Cohn and T. Kanade, “Automated facial image analysisfor measurement of emotion expression,” inThe handbook ofemotion elicitation and assessment. Oxford University PressSeries in Affective Science, J. A. C. . J. B. Allen, Ed., pp.222–238.

[32] J. F. Cohn and M. A. Sayette, “Spontaneous facial expressionin a small group can be automatically measured: An initialdemonstration.”Behavior Research Methods, vol. 42, no. 4,pp. 1079–1086, 2010.

[33] M. F. Valstar, I. Patras, and M. Pantic, “Facial action unitdetection using probabilistic actively learned support vectormachines on tracked facial point data,” inCVPRW05, 2005,pp. 76–83.

[34] M. Valstar and M. Pantic, “Fully automatic facial action unitdetection and temporal analysis,” inCVPR06, 2006, pp. 149–157.

[35] S. Lucey, I. Matthews, C. Hu, Z. Ambadar, F. De laTorre, and J. F. Cohn, “Aam derived face representationsfor robust facial action recognition,” inProceedings of theSeventh IEEE International Conference on Automatic Faceand Gesture Recognition (FG’06), 2006, pp. 155–160.

[36] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “Asurvey of affect recognition methods: Audio, visual, andspontaneous expressions,”IEEE TPAMI, pp. 39–58, 2009.

[37] T. F. Cootes, G. Edwards, and C. Taylor, “Active appearancemodels,” inECCV98, 1998, pp. 484–498.

[38] F. De la Torre and M. Nguyen, “Parameterized kernel


principal component analysis: Theory and applications tosupervised and unsupervised image alignment,” inCVPR08,2008.

[39] I. Matthews and S. Baker, “Active appearance models revis-ited,” IJCV04, pp. 135–164, 2004.

[40] V. Blanz and T. Vetter, “A morphable model for the synthesisof 3d faces.” inSIGGRAPH99, 1999.

[41] Y. L. Tian, T. Kanade, and J. F. Cohn, “Evaluation ofgabor-wavelet-based facial action unit recognition in imagesequences of increasing complexity,” inProceedings of theSeventh IEEE International Conference on Automatic Faceand Gesture Recognition (FG’02). Springer, 2002, pp. 229–234.

[42] A. B. Ashraf, S. Lucey, T. Chen, K. Prkachin, P. Solomon,Z. Ambadar, and J. F. Cohn, “The painful face: Painexpression recognition using active appearance models,”in Proceedings of the ACM International Conference onMultimodal Interfaces (ICMI’07), 2007, pp. 9–14.

[43] P. Lucey, J. F. Cohn, S. Lucey, S. Sridharan, and K. M.Prkachin, “Automatically detecting pain using facial ac-tions.” in International Conference on Affective Computingand Intelligent Interaction (ACII2009), 2009.

[44] D. Lowe, “Object recognition from local scale-invariantfeatures,” inICCV99, 1999, pp. 1150–1157.

[45] K. Sung and T. Poggio, “Example-based learning for view-based human face detection,”IEEE TPAMI, pp. 39–51, 1998.

[46] L. Breiman,Classification and regression trees. Chapman& Hall/CRC, 1998.

[47] R. Schapire and Y. Freund, “Experiments with a new boost-ing algorithm,” in Machine learning: proceedings of theThirteenth International Conference (ICML’96). MorganKaufmann Pub, 1996, p. 148.

[48] E. Tola, V. Lepetit, and P. Fua, “A fast local descriptorfordense matching,” inCVPR08, 2008.

[49] K. Mikolajczyk and C. Schmid, “A performance evaluationof local descriptors,”IEEE TPAMI, vol. 27, no. 10, pp. 1615–1630, 2005.

[50] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient densedescriptor applied to wide baseline stereo,”IEEE TPAMI,vol. 99, no. 1, pp. 3451–3467, 2009.

Yunfeng Zhu received the B.Sc. degree inInformation Engineering, the M.Sc. degree inPattern Recognition and Intelligent Systemfrom Xi’an Jiaotong University in 2003 and2006 respectively. He is currently a Ph.D.candidate at Image and Graphics Institute, inthe Department of Electronic Engineering atTsinghua University, Beijing, China. In 2008and 2009, he became visiting student in theRobotics Institute at Carnegie Mellon Univer-sity (CMU), Pittsburgh, PA. His research in-

terests include computer vision, machine learning, facial expressionrecognition and 3D facial model reconstruction.

Fernando De la Torre received the B.Sc.degree in telecommunications and the M.Sc.and Ph.D. degrees in electronic engineeringfrom Enginyeria La Salle in Universitat Ra-mon Llull, Barcelona, Spain, in 1994, 1996,and 2002, respectively. In 1997 and 2000,he became an Assistant and an AssociateProfessor, respectively, in the Department ofCommunications and Signal Theory at theEnginyeria La Salle. Since 2005, he hasbeen a Research Faculty at the Robotics

Institute, Carnegie Mellon University (CMU), Pittsburgh, PA. Hisresearch interests include machine learning, signal processing, andcomputer vision, with a focus on understanding human behaviorfrom multimodal sensors. He is directing the Human Sensing Lab(http://humansensing.cs.cmu.edu) and the Component Analysis Labat CMU (http://ca.cs.cmu.edu). He has co-organized the first work-shop on component analysis methods for modeling, classification,and clustering problems in computer vision in conjunction withCVPR-07, and the workshop on human sensing from video inconjunction with CVPR-06. He has also given several tutorials atinternational conferences on the use and extensions of componentanalysis methods.

Jeffrey F. Cohn earned the PhD degreein clinical psychology from the University ofMassachusetts in Amherst. He is a professorof psychology and psychiatry at the Univer-sity of Pittsburgh and an Adjunct Facultymember at the Robotics Institute, CarnegieMellon University. For the past 20 years, hehas conducted investigations in the theoryand science of emotion, depression, andnonverbal communication. He has co-led in-terdisciplinary and interinstitutional efforts to

develop advanced methods of automated analysis of facial expres-sion and prosody and applied these tools to research in human emo-tion and emotion disorders, communication, biomedicine, biometrics,and human-computer interaction. He has published more than 120papers on these topics. His research has been supported by grantsfrom the US National Institutes of Mental Health, the US NationalInstitute of Child Health and Human Development, the US NationalScience Foundation, the US Naval Research Laboratory, and the USDefense Advanced Research Projects Agency. He is a member ofthe IEEE and the IEEE Computer Society.

Yu-Jin Zhang received the Ph.D. degree inApplied Science from the State Universityof Lige, Lige, Belgium, in 1989. From 1989to 1993, he was post-doc fellow and re-search fellow with the Department of AppliedPhysics and Department of Electrical Engi-neering at the Delft University of Technology,Delft, the Netherlands. In 1993, he joinedthe Department of Electronic Engineering atTsinghua University, Beijing, China, wherehe is a professor of Image Engineering since

1997. In 2003, he spent his sabbatical year as visiting professorin the School of Electrical and Electronic Engineering at NanyangTechnological University (NTU), Singapore. His research interestsare mainly in the area of Image Engineering that includes imageprocessing, image analysis and image understanding, as well astheir applications. His current interests are on object segmentationfrom images and video, segmentation evaluation and comparison,moving object detection and tracking, face recognition, facial ex-pression detection/classification, content-based image and videoretrieval, information fusion for high-level image understanding, etc.He is vice president of China Society of Image and Graphics anddirector of academic committee of the Society. He is also a Fellow ofSPIE.

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1 Dynamic...

Documents

Transcript of IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1 Dynamic...