JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX ...zhaojuny/docs/TAC_Gesture.pdfJOURNAL OF AFFECTIVE...

13
JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 1 Modeling Dynamics of Expressive Body Gestures In Dyadic Interactions Zhaojun Yang, Student Member, IEEE, and Shrikanth Narayanan, Fellow, IEEE Abstract—Body gestures are an important non-verbal expression channel during affective communication. They convey human attitudes and emotions as they dynamically unfold during an interpersonal interaction. Hence, it is highly desirable to understand the dynamics of body gestures associated with emotion expression in human interactions. We present a statistical framework for robustly modeling the dynamics of body gestures in dyadic interactions. Our framework is based on high-level semantic gesture patterns and consists of three components. First, we construct a universal background model (UBM) using Gaussian mixture modeling (GMM) to represent subject-independent gesture variability. Next, we describe each gesture sequence as a concatenation of semantic gesture patterns which are derived from a parallel HMM structure. Then, we probabilistically compare the segments of each gesture sequence extracted from the second step with the UBM obtained from the first step, in order to select highly probabilistic gesture patterns for the sequence. The dynamics of each gesture sequence are represented by a statistical variation profile computed from the selected patterns, and are further described in a well-defined kernel space. This framework is compared with three baseline models and is evaluated in emotion recognition experiments, i.e., recognizing the overall emotional state of a participant in a dyadic interaction from the gesture dynamics. The recognition performance demonstrates the superiority of the proposed framework over the baseline models. The analysis of the relationship between the emotion recognition performance and the number of the selected segments also indicates that a few local salient events, rather than the whole gesture sequence, are sufficiently informative to trigger the human summarization of their overall global emotion perception. Index Terms—Body gesture, gesture patterns, gesture dynamics, emotion recognition, motion capture. 1 I NTRODUCTION I N human communication, body gestures are an essential element of non-verbal behavior to express interpersonal attitudes, feelings and affect [1] [2] [3]. Research on recog- nizing emotions using body gesture expressions has hence received much interest [4] [5]. Existing research has mostly focused on specific types of short-term body gestures, e.g., knocking, walking or waving, suggesting that body gestures are emotion-specific to some extent [6] [7]. A quantitative knowledge of how body gestures are involved in emotion communication as they dynamically unfold during an inter- personal interaction is however still largely understudied. Understanding the dynamics of body gestures in the context of emotion expression can have a significant impact on auto- matic emotion recognition as well as the design of advanced human-machine interfaces. For example, the pedagogical agent developed in [8] incorporates a probabilistic model of a user’s affect. By monitoring the user’s affect in the interactions during educational games, the agent adjusts the decisions about generating appropriate interventions to improve the user’s learning effectiveness. The major challenge in modeling dynamics of body gestures is the high degree of gesture variability. Human body gestures are complex in nature in terms of both temporal patterning and spatial details, varying both within and across individuals as well as over different time scales. The gestures produced during interpersonal interactions are even more intricate due to the interaction-context factors Zhaojun Yang and Shrikanth Narayanan are with the Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089 USA (e-mail: [email protected], [email protected]). This work was supported in part by NSF. such as the stances assumed, the variety and variation in communication intentions and the behavior of conversa- tional partners. However, analogous to the compositional view of visemes in lip motion, there are elementary pat- terns that have been defined for body gestures, i.e., gesture phrases/units. In the gesture model proposed by Kendon [9], a gesture phrase defines the basic gesture element and a natural continuous gesture can be decomposed into mul- tiple gesture phrases. There have been a few approaches for representing gesture dynamics [5] [10] [11] including the semantic pattern based description of body gestures which provides us the basis of modeling gesture dynamics in this paper as well. These existing works however are mainly designed for short-term body gestures with rhythmically repeating patterns. Moreover, they only consider variations locally within each individual gesture sequence, which may induce subject-dependent characteristics and undermine the essential dynamic cues. Hence, robustly representing ges- ture dynamics, especially for long-term gesture sequences with high dynamical complexity, becomes another pressing need and is studied in this work. The main objective of this work is to robustly model the dynamics of expressive body gestures occurring in long- term dyadic interactions. We propose a statistical frame- work based on the high-level semantic gesture patterns. In this framework, we first construct a universal background model (UBM) using Gaussian mixture model (GMM) to describe the global subject-independent variability of ges- ture features. Next, we employ a parallel HMM structure to segment gesture sequences and extract semantic gesture patterns. Further, we probabilistically fit the segments of each gesture sequence with the UBM, in order to select

Transcript of JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX ...zhaojuny/docs/TAC_Gesture.pdfJOURNAL OF AFFECTIVE...

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 1

    Modeling Dynamics of Expressive BodyGestures In Dyadic Interactions

    Zhaojun Yang, Student Member, IEEE, and Shrikanth Narayanan, Fellow, IEEE

    Abstract—Body gestures are an important non-verbal expression channel during affective communication. They convey humanattitudes and emotions as they dynamically unfold during an interpersonal interaction. Hence, it is highly desirable to understand thedynamics of body gestures associated with emotion expression in human interactions. We present a statistical framework for robustlymodeling the dynamics of body gestures in dyadic interactions. Our framework is based on high-level semantic gesture patterns andconsists of three components. First, we construct a universal background model (UBM) using Gaussian mixture modeling (GMM) torepresent subject-independent gesture variability. Next, we describe each gesture sequence as a concatenation of semantic gesturepatterns which are derived from a parallel HMM structure. Then, we probabilistically compare the segments of each gesture sequenceextracted from the second step with the UBM obtained from the first step, in order to select highly probabilistic gesture patterns for thesequence. The dynamics of each gesture sequence are represented by a statistical variation profile computed from the selectedpatterns, and are further described in a well-defined kernel space. This framework is compared with three baseline models and isevaluated in emotion recognition experiments, i.e., recognizing the overall emotional state of a participant in a dyadic interaction fromthe gesture dynamics. The recognition performance demonstrates the superiority of the proposed framework over the baseline models.The analysis of the relationship between the emotion recognition performance and the number of the selected segments also indicatesthat a few local salient events, rather than the whole gesture sequence, are sufficiently informative to trigger the human summarizationof their overall global emotion perception.

    Index Terms—Body gesture, gesture patterns, gesture dynamics, emotion recognition, motion capture.

    F

    1 INTRODUCTION

    IN human communication, body gestures are an essentialelement of non-verbal behavior to express interpersonalattitudes, feelings and affect [1] [2] [3]. Research on recog-nizing emotions using body gesture expressions has hencereceived much interest [4] [5]. Existing research has mostlyfocused on specific types of short-term body gestures, e.g.,knocking, walking or waving, suggesting that body gesturesare emotion-specific to some extent [6] [7]. A quantitativeknowledge of how body gestures are involved in emotioncommunication as they dynamically unfold during an inter-personal interaction is however still largely understudied.Understanding the dynamics of body gestures in the contextof emotion expression can have a significant impact on auto-matic emotion recognition as well as the design of advancedhuman-machine interfaces. For example, the pedagogicalagent developed in [8] incorporates a probabilistic modelof a user’s affect. By monitoring the user’s affect in theinteractions during educational games, the agent adjuststhe decisions about generating appropriate interventions toimprove the user’s learning effectiveness.

    The major challenge in modeling dynamics of bodygestures is the high degree of gesture variability. Humanbody gestures are complex in nature in terms of bothtemporal patterning and spatial details, varying both withinand across individuals as well as over different time scales.The gestures produced during interpersonal interactions areeven more intricate due to the interaction-context factors

    • Zhaojun Yang and Shrikanth Narayanan are with the Department ofElectrical Engineering, University of Southern California, Los Angeles,CA, 90089 USA (e-mail: [email protected], [email protected]).

    • This work was supported in part by NSF.

    such as the stances assumed, the variety and variation incommunication intentions and the behavior of conversa-tional partners. However, analogous to the compositionalview of visemes in lip motion, there are elementary pat-terns that have been defined for body gestures, i.e., gesturephrases/units. In the gesture model proposed by Kendon[9], a gesture phrase defines the basic gesture element anda natural continuous gesture can be decomposed into mul-tiple gesture phrases. There have been a few approachesfor representing gesture dynamics [5] [10] [11] including thesemantic pattern based description of body gestures whichprovides us the basis of modeling gesture dynamics in thispaper as well. These existing works however are mainlydesigned for short-term body gestures with rhythmicallyrepeating patterns. Moreover, they only consider variationslocally within each individual gesture sequence, which mayinduce subject-dependent characteristics and undermine theessential dynamic cues. Hence, robustly representing ges-ture dynamics, especially for long-term gesture sequenceswith high dynamical complexity, becomes another pressingneed and is studied in this work.

    The main objective of this work is to robustly model thedynamics of expressive body gestures occurring in long-term dyadic interactions. We propose a statistical frame-work based on the high-level semantic gesture patterns. Inthis framework, we first construct a universal backgroundmodel (UBM) using Gaussian mixture model (GMM) todescribe the global subject-independent variability of ges-ture features. Next, we employ a parallel HMM structureto segment gesture sequences and extract semantic gesturepatterns. Further, we probabilistically fit the segments ofeach gesture sequence with the UBM, in order to select

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 2

    highly probabilistic patterns for the sequence. The gesturedynamics of a sequence are represented as a statistical vari-ation profile computed from the selected salient segments,and are described in a well-defined kernel space.

    The advantages of our framework for robust gesture dy-namics modeling are summarized as follows: 1) To addressthe high variability and complex structure of body ges-tures, each gesture sequence is described based on semanticgesture patterns. 2) To unify the information of all thegesture sequences, the gesture patterns of each sequence arefitted to a statistical UBM. This procedure helps to removeindividual idiosyncrasies within a gesture sequence. 3) Tocapture the relationship between the modeled dynamics oftwo gesture sequences, we propose a kernel function basedapproach.

    Our work is evaluated on the freely-available multi-modal USC CreativeIT database that consists of goal-drivenimprovised interactions [12] [13]. It contains detailed fullbody Motion Capture (MoCap) data, providing a rich re-source for studying body gestures during expressive interac-tions. We focus on hand gesture and head motion which areexpressive and informative in communication. Our modelis evaluated on emotion recognition tasks. The experimentalresults show that the proposed UBM-based model outper-forms other examined approaches in terms of recognitionaccuracy. We also observe that hand gesture carries moreaffect-specific information than rigid head motion.

    The rest of the paper is organized as follows. We discussrelated work on the gesture-affect relationship, especially onbody gesture modeling for emotion recognition, in Section 2.The proposed framework for representing gesture dynamicsis presented in Section 3. We introduce the multimodalCreativeIT database, the gesture feature extraction and emo-tion annotation in Section 4. We present the analysis of theextracted gesture patterns in Section 5. The baseline modelsare described in Section 6, followed by the experimental re-sults of emotion recognition tasks in Section 7. This paper’sconclusions and future work are given in Section 8.

    2 RELATED WORKBody gestures are used as an integral means of expressionin human communication. There is an extensive literaturestudying the gesture-emotion relationship and showingbody gestures are as important as facial expressions inemotion conveyance. Wallbott analyzed the emotional con-tent of acted body movements and postures using codingschemata, showing that body movements and body pos-ture are indicative of specific emotions [6]. De Meijer alsocorroborated that different emotion categories can be in-ferred from the intensity and the types of body movements[7]. Mehrabian and Friar found that body orientation isan important indicator of the communication attitude ofa participant towards one’s interlocutor [14]. Researchershave also demonstrated that gestures are communicativelyintended by speakers and express the underlying cognitivearchitecture in a conversation [15] [16].

    The above-mentioned psychological studies on thegesture-emotion relation have inspired work on automaticemotion recognition from body expressions. Early workin this direction has focused on detecting emotions from

    acted and stylized body movements and postures. For ex-ample, gait patterns have been analyzed with respect tothe affective state of an individual in both categorical anddimensional emotion spaces [17]. Kapur et al. investigatedlow-level physical features of stylized body movements, e.g.,velocities and accelerations of marker positions, for emotiondiscrimination. The low-level dynamics have shown to beeffective in distinguishing the four basic emotions of sad-ness, joy, anger and fear [18].

    Since body gestures commonly occur in daily humancommunication, research efforts have also been devotedto exploring the potential of using interaction gestures foremotion recognition. Metallinou et al. have used body lan-guage information for automatically tracking the continu-ous emotional attributes of activation, valence and domi-nance of a participant over affective communication [19].They describe the body language of a participant in termsof body movements and postures, such as hand velocities,head angles and body positions. Nicolaou et al. tracked thecontinuous human affect during a human-agent conversa-tion by fusing shoulder movements, facial expressions andspeech cues [20]. To consider the temporal dynamics of non-acted body gestures, a Recurrent Neural Network algorithmwas employed for emotion recognition in the context of avideo game [21].

    In spite of the success of the low-level gesture dynamicsfor emotion detection, such descriptions are insufficient tocapture the structure and dynamical cues of natural bodygestures produced in long-term interactions, due to the so-phisticated multi-scale nature of human gestures. To modelthe complex structure of body gestures, researchers havehence attempted to decompose the complex motion intosimple isolated elements. Bernhardt and Robinson repre-sented the dynamical cues of the knocking motion usingmotion primitives that are derived from k-means clustering[5]. For robust emotion recognition, the primitive-basedmotion dynamics are modeled in an individual-unbiasedway by removing subject-specific characteristics. They fur-ther extended this framework for detecting emotions fromnatural action sequences [22]. Camurri et al. analyzed theaffective states of dance sequences by developing mid-levelfeatures from the segmentation of the dance trajectories [23].

    In order to obtain meaningful motion patterns, severalapproaches have been proposed for learning the primitivesof human actions from motion sequences. For example,Levine et al. derived gesture subunits from motion data bydetecting zero points of the angular velocity [24]. Based onthese motion segments, they further applied Ward hierar-chical clustering to identify the recurring motion subunits.A probabilistic PCA based algorithm has been proposed in[25] to segment motion data into distinct actions, under theassumption that a motion transition occurs when the dis-tribution of the motion data changes. Zhou et al. proposedan unsupervised hierarchical framework that combines thekernel k-means and a generalized dynamic time alignmentkernel, for temporally segmenting and clustering multi-dimensional time series [26]. This framework has shownpromising results in clustering a small number of humanactions. However, like most kernel methods, this approachalso suffers from the high computation and storage com-plexity on large-scale data, due to the computation of an

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 3

    n×n kernel matrix, where n is the length of a sequence. Thehigh computation demand limits the applicability of thisframework only to relatively short time series, and henceit is computationally impractical for the model to performon a large amount of long-term sequences, e.g., interaction-context sequences (often over 10, 000 frames) [26]. To effec-tively process the gesture sequences from long-term interac-tions, a parallel HMM structure has been applied to extractgesture primitives for gesture dynamics modeling recently[10] [11]. The parallel HMM structure processes sequencessequentially and effectively. The efficiency, effectiveness andflexibility makes this model a suitable technique for ges-ture pattern identification. It has been successfully used forprosody-driven head motion animation [27] and music-to-dance synthesis [28].

    In this work, our framework for body gesture model-ing is based on the elementary gesture patterns identifiedfrom the parallel HMM model. In order to minimize theindividual-dependent variations and to unify the informa-tion from all the gesture sequences, we construct a GMM-UBM model to describe the global gesture variability. In-stead of representing dynamics locally within each sequenceas most of the above-mentioned studies did, we align thepatterns of each sequence with the subject-independentGMM-UBM model for robustly representing the dynam-ics. This GMM-UBM model has been widely applied foracoustic feature modeling in text-independent speaker ver-ification [29] [30]. It captures inter-speaker variability andis viewed as a viable speaker model. Li et al. developed aGMM-UBM based face matching system for pose variantface verification [31]. Liu et al. proposed a framework alsobased on GMM-UBM for dynamical face recognition [32],which is similar in spirit to the approach adopted in thepresent paper. In their work, each video clip is describedby uniformly sampled cuboids or segments. In our work,the gesture patterns are automatically identified, whichprovides semantic meanings as well as robust gesture de-scriptions. In addition, we consider more comprehensivegesture statistics for representing the dynamics. The rela-tionship between the representations of dynamics is furtherdescribed in a well-defined kernel function.

    3 FRAMEWORK OVERVIEW

    Fig. 1 illustrates an overview of the framework that wepropose for body gesture modeling. In the framework, agesture sequence S is represented by the gesture features{f1, f2, · · · , fT }, where T is the length of S and ft is agesture feature vector at time t. First, we build a UBM tostatistically describe the global variability profile of gesturefeatures from different individuals. A GMM is employedto learn this background model. Each component of thelearned GMM represents one universal variation mode ofthe gesture features at the frame level. We further use theindividual gesture sequences to train a parallel HMM struc-ture for identifying temporally recurring gesture patterns.Henceforth, each gesture sequence is partitioned to shortsegments corresponding to the gesture patterns. Finally, thedynamics of an entire gesture sequence are modeled by astatistical variation profile that is computed from the statis-

    Fig. 2. The parallel HMMs for capturing gesture patterns. The number ofbranches, M , corresponds to the number of gesture clusters.

    tically salient segments with respect to each component ofthe UBM.

    In the rest of this section, we present the constructionof the background model of gesture features using GMM(Section 3.1), the details of gesture clustering using theparallel HMM structure (Section 3.2) and the description ofthe statistical gesture dynamics modeling (Section 3.3).

    3.1 Universal Background Model ConstructionTo statistically describe the global variability profile of thegesture features from different subjects, we build a universalbackground model to represent the subject-independentdistribution of gesture features. In this work, we use GMMsto learn the background model. GMMs have shown greatsuccess for universal background modeling in speaker ver-ification [29] [30] and facial expression recognition [31][32]. Each component of the GMM describes one class ofuniversal variations of gesture features at the frame level.A GMM background model, Θ = {πk,µk,Σk}Kk=1, is con-structed based on the feature vectors of gesture sequences{f1, f2, · · · , fT },

    p(f |Θ) =K∑k=1

    πkN (f ;µk,Σk), (1)

    where K is the number of mixtures, πk, µk and Σk arethe weight, mean vector and covariance matrix of the k-th component, and

    ∑Kk=1 πk = 1. In this work, {Σk}Kk=1

    are specified as diagonal covariance matrices. The GMMparameters Θ can be estimated based on the maximumlikelihood criterion using Expectation Maximization (EM).

    3.2 Extraction of Gesture PatternsThe elementary gesture patterns have not been well estab-lished quantitatively so far, due to the nature of the gesturestructure — the high degree of variability across personsand contexts as well as along different temporal scales. Inthis work, we hence identify the recurring patterns in anunsupervised manner. We employ the parallel HMM modelfor the extraction of elementary phrases [27]. This modelprovides flexibility and efficiency in modeling the variationsin the structure and durations of gesture patterns.

    The parallel HMM model, Λ, is composed of M parallelleft-to-right HMMs {λm}Mm=1, where each branch λm hasQ states {sm,1, sm,2, · · · , sm,Q}, as shown in Fig. 2. M

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 4

    Fig. 1. An overview of the proposed framework for gesture dynamics modeling. Starting with a set of gesture sequences from the top of thediagram, we first build a universal background model (UBM) to capture the global gesture variability across individuals. We then employ a parallelHMM structure to extract elementary gesture patterns. Based on the constructed UBM and the segmentation model, a new gesture sequence canbe segmented into gesture patterns, and the gesture dynamics are computed by probabilistically fitting the extracted patterns with the UBM. Thefitting procedure helps to remove personal idiosyncrasies and to provide a robust dynamics representation.

    corresponds to the number of clusters, i.e., the number ofgesture patterns. All the branches share the same startingand ending states ss and se to ensure the continuity atthe boundaries between segments. We empirically select thenumber of states in each branch of the HMM model Λ asQ = 10, corresponding to the minimum gesture patternduration of 10 frames ( 16 sec assuming 60 frames/sec).The sequence of gesture vectors S is used to train themodel Λ. The segmentation and clustering are performedby maximizing the likelihood using Viterbi decoding:

    {�l,ml}Ll=1 = arg max{�l,ml}

    L∏∏∏l=1

    P (�l|λml), (2)

    where {�1, �2, · · · , �L} are the L number of gesture seg-ments produced by the model Λ. Each segment �l which isrepresented by a feature set, F�l = {ftl , ftl+1, · · · , ftl+1−1},is assigned to one of the M clusters with a cluster label ml(ml ∈ {1, 2, · · · ,M}). The segments �l associated with clus-ter labelsml represent the recurring gesture patterns that arecaptured by the probabilistic structure Λ. We accordinglydefine the frame-level labels gt based on this association,i.e., gt = ml, if ft ∈ �l. As a result, a sequence of gesturevectors S can be mapped into a sequence of cluster labelsg = {g1, g2, · · · , gT }.

    3.3 Statistical Gesture Dynamics ModelingThe background model Θ obtained in Section 3.1 repre-sents the global subject-independent distribution of gesturefeatures, and the gesture patterns extracted in Section 3.2provide a high-level semantic description of a gesture se-quence. Given any gesture sequence, the probabilistic fittingof its gesture patterns to the background model could unifythe information from all the sequences and provide ro-bust dynamics characterization by removing subject-specificvariations within the sequence. We therefore formulate themodeling of statistical gesture dynamics as follows.

    As described in Section 3.2, a gesture sequence S ispartitioned into short segments {�1, �2, · · · , �L}, where �l isrepresented by a feature set F�l = {ftl , ftl+1 · · · , ftl+1−1}.From the k-th component in the background model, theprobability, pk�l , of �l can be computed as,

    pk�l =1

    tl+1 − tl

    tl+1−1∑t=tl

    πkN (ft;µk,Σk). (3)

    A segment with a higher probability pk�l is statisticallysalient within the sequence S. This salient gesture patternreflects a subject’s central gesture expression which mayplay an important role in conveying one’s internal mentalstate. Moreover, a salient pattern stands out through a

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 5

    statistical comparison with the universal model Θ, whichunifies the information from all the sequences and removesthe subject-specific idiosyncrasies. Hence, the highly proba-bilistic gesture patterns of a sequence could better provideboth informative and robust characterization of gesturedynamics. For each sequence S, we consider N salientsegments with the highest probabilities with respect to thek-th Gaussian component. These selected segments form anew feature set Fk = {Fkk1 ,F

    kk2· · · ,FkkN } associated with a

    probability set pk = {pkk1 , pkk2· · · , pkkN }, where kn indicates

    the segment with the n-th highest probability under the k-thGMM component.

    To model the variations among the N salient patterns ofS with respect to the k-th GMM component, we calculatethe covariance matrix of the feature set Fk,

    Covk =1

    N − 1

    N∑n=1

    (Fkkn − F̄k)(Fkkn − F̄

    k)T , (4)

    where F̄k is the mean feature vector in Fk. The elements ofCovk describe the variations and co-variations among indi-vidual gesture features within a sequence, i.e., characterizingthe long-term statistical gesture changes. In addition, theassociated probability set pk captures the general likelihoodof the salient patterns in S. The dynamics of a gesturesequence can therefore be represented by a statistical profilethat is suited with both covariance and probability sets,{Covk}Kk=1 and {pk}Kk=1.

    The covariance descriptors, {Covk}Kk=1, are symmetricpositive definite (SPD) matrices which lie on a Riemannianmanifold. We exploit the Log-Euclidean distance betweentwo points on the manifold [33],

    dist(Covki ,Covkj ) = || log(Cov

    ki )− log(Cov

    kj )||F , (5)

    where log(·) is the matrix logarithm operator, and || · ||F isthe Frobenius norm. Given the eigenvalue decomposition ofan SPD matrix, Cov = UΣUT , the matrix logarithm can becomputed as,

    log(Cov) = U log(Σ)UT . (6)

    We further define the distance between sets of covariancedescriptors of sequences Si and Sj as the sum of individualcovariance distances,

    distcov(Si, Sj) =

    K∑k=1

    dist(Covki ,Covkj ). (7)

    Accordingly, we define the distance between sets ofprobability descriptors of Si and Sj using L2 norm,

    distp(Si, Sj) =

    K∑k=1

    ||pki − pkj ||2. (8)

    Both types of distance metrics in Eq. (7) and (8) measurethe distance between statistical dynamics of Si and Sj . Theycan be further used to formularize a kernel function,

    k(Si, Sj) = exp{−dist2p(Si, Sj)

    σ2p− dist

    2cov(Si, Sj)

    σ2cov}. (9)

    This kernel function, k(Si, Sj), can be readily used to con-struct any kernelized classifiers. In this work, we build an

    emotion classifier based on this kernel to predict the emo-tional state of a subject using the modeled gesture dynamicsover an interaction.

    4 DATABASE DESCRIPTIONWe use the USC CreativeIT database in this work, which isa freely-available multimodal database of dyadic theatricalimprovisations [12] [13]. The interactions performed by thepairs of actors are either improvisations of scenes fromtheatrical plays or theatrical exercises where actors repeatsentences to express the interaction goals that feature spe-cific emotions. The interactions were guided by a theaterexpert (professor/director), following the Active Analysisimprovisation technique pioneered by Stanislavsky [34].According to this technique, interactions are goal-driven;actors have predefined goals, e.g., to comfort or to avoid,which can drive and elicit natural realizations of emotionsas well as expressive speech and body gesture behavior.

    This database contains detailed full body Motion Cap-ture (MoCap) data of the two interacting participants duringa dyadic interaction. A Vicon motion capture system with12 cameras was used to capture the (x, y, z) positions ofthe 45 markers over each actor at 60fps, as shown in Figure3. There are 50 dyadic interactions in total performed by16 actors (9 female), resulting in 100 actor-recordings. Eachinteraction has an average length of about 3 minutes.

    Fig. 3. The positions of the 45 Motion Capture markers over an actor.

    4.1 Gesture Feature ExtractionAfter capturing the motion data, we manually mapped the3D locations of the markers to the angles of different humanbody joints using MotionBuilder [35]. The mapped anglesare then used as body gesture features. The joint anglesare preferred instead of 3D coordinates to describe gestures,because they are more suitable for animation purposes [24][27] and the subject-dependent gesture characteristics (e.g.,the arm length) have been removed through the mappingprocess. In this work, we focus on two types of bodygestures which are expressive and emotion-informative incommunication: hand gesture and head motion [36]. Figure4 illustrates the Euler angles of the hand (arm and forearm)and head joints in the x, y and z directions.

    To incorporate the temporal dynamics, we augment thegesture features with their 1st order derivatives. The gesturefeature vector fnt of the joint n at frame t is:

    fnt = [θnt , φ

    nt , ψ

    nt ,∆θ

    nt ,∆φ

    nt ,∆ψ

    nt ]T , (10)

    where θnt , φnt and ψ

    nt are the Euler angles of the joint n

    respectively in the x, y and z directions (see Fig. 4), and ∆θnt ,

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 6

    ∆φnt and ∆ψnt are their corresponding 1st order derivatives.

    The hand gesture features include the information of thefour hand joints, i.e., left and right arms, as well as leftand right forearms. The hand gesture is then representedby [f leftarmt ; f

    rightarmt ; f

    leftforearmt ; f

    rightforearmt ]. Similarly, the head

    motion is related to only one joint, and is represented by a6-D feature vector fheadt .

    Fig. 4. The illustration of the Euler angles of the hand (arm and forearm)and head joints.

    4.2 Emotion AnnotationWe collected annotations of the global emotional contentin each performance. The emotional content was rated interms of perceived activation and valence for each actor ona 5-point scale by three or four annotators. Specifically, eachannotator provided two global ratings for each actor in arecording. Each rating summarizes the overall perception ofactivation and valence. Rating 1 denotes the lowest possibleactivation level and the most negative valence level. Rating5 indicates the highest possible activation level and the mostpositive valence level.

    The consistency of the annotations was examined bycomputing the Cronbach’s α coefficient over the global acti-vation and valence ratings from all different annotators. TheCronbach’s α coefficient measures the internal consistency,i.e., how closely related a group of raters are. It is widelyused in the cognitive sciences. The values of α varies from0 to 1. A higher value indicates a higher level of inter-rateragreement. Overall, we notice that the annotator consistencyis at an acceptable moderate level (0.72 for activation and0.78 for valence).

    The rating values of activation and valence of each actorin an interaction are calculated by averaging the annotationsacross annotators. To provide richer and more expressive

    Fig. 5. Resulting emotion classes in the activation-valence space forC = 3 and C = 4.

    emotion varieties, we jointly consider activation and valenceto create C emotional clusters in the activation-valencespace using k-means algorithm. We consider clusters withC = 3 and C = 4 which are conventional choices foremotion classification in the activation-valence space em-pirically established in the previous literature [37] [38] [39].In addition, C = 3 and C = 4 are chosen to ensure there aresufficient data samples in each class for the emotion modeltraining. Fig. 5 shows the corresponding clustering results.The attribute-based emotion labels have shown to be relatedto the categorical emotions [37] [38] [39]. For example,the three emotion clusters generally represent categories ofhappiness or excitement, anger, and sadness or neutrality.

    5 ANALYSIS OF GESTURE PATTERNSAn essential component of our framework is to extractmeaningful gesture patterns as described in Section 3.2.In this section, we investigate the effect of the number ofgesture patterns, M , on the dynamics modeling. We thenverify the validity of the derived clusters for representingsemantic gesture patterns.

    5.1 Segmentation of Gesture Patterns

    The number of parallel HMMsM , i.e., the number of gesturepatterns, is an important parameter which affects gesturedynamics modeling. A small M provides a coarse-grainedgesture description, while a large M leads to a noisy gesturerepresentation. In order to identify a suitable number ofgesture patterns, we employ a bigram model to capturethe dynamical evolution of a gesture sequence with respectto different cluster numbers. A high-quality bigram modelindicates an appropriate M .

    A bigram model is a first-order Markov model, popularin modeling word sequences (here sequences of gesturelabels g) in language processing. We use the sequences gobtained in Section 3.2 to calculate the transition (bigram)probabilities of the gesture labels within a sequence. Givengt at time t and gt−1 at previous time t − 1, P (gt|gt−1) de-fines the bigram probability that the gesture gt occurs if theprevious gesture gt−1 has been observed. We use perplexityto evaluate the computed gesture bigram model. Perplexityis a popular measure to evaluate language (word sequence)models [40]. It quantifies the confusion of the current state,i.e., the average number of possible successors, from aninformation theoretic perspective. A lower perplexity indi-cates a better bigram model, and a higher perplexity impliesmore randomness in the derived sequences of gesture labels.The perplexity ppl of a bigram model is defined as:

    ppl = P (g)−1

    |g| , (11)

    where |g| is the length of a gesture sequence. The probabilityP (g) is computed using the bigram model as:

    P (g) = P (g1)

    |g|∏∏∏t=2

    P (gt|gt−1). (12)

    However, this measure depends on the vocabulary size(the number of clusters M ), i.e., a larger M leads to ahigher perplexity. To alleviate this dependency, we apply

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 7

    20 40 60 80 1000.1

    0.2

    0.3

    0.4

    0.5

    0.6

    The number of clusters M

    Nor

    mal

    ized

    Per

    plex

    ity

    Hand gestureHead motion

    Fig. 6. Normalized perplexity of the bigram model of hand gesture orhead motion varying with the number of clusters. The decreasing ratefor hand gesture slows down from M = 50, and the changing rate forhead motion falls off from M = 40.

    the normalized perplexity ppl by taking the ratio betweenppl and M [41],

    ppl =ppl

    M. (13)

    We extract gesture patterns using the number of clusters(M ) ranging from 10 to 100. A gesture bigram model is thenlearned with respect to each cluster number. The normalizedperplexity in Eq. (13) is adopted to evaluate each model.Fig. 6 (the square marker) shows the normalized perplexityof the bigram model of hand gesture as a function of M .Overall, we can observe that ppl decreases as the numberof clusters increases. Specifically, ppl drops rapidly with anincreasing M initially, then the decrease slows down forM = 50 or above. This result suggests that the transitiondynamics of hand gesture patterns can be adequately cap-tured by the bigram model computed using 50 clusters.Higher cluster numbers bring only minor variations, andeven noise, to the computed structure, while greatly increas-ing the computational cost. Fig. 6 also presents the ppl-Mrelationship for head motion (the circle marker). We canobserve a decreasing trend similar to that for hand gesture.The ppl for head motion, however, converges from M = 40.Fewer head motion varieties are found compared to thehand gesture categories, because of the smaller degree offreedom for head motion which leads to less variability (seeFig. 4). In the experiments that follow, we fix the number ofhand gesture patterns at 50 and the number of head motionpatterns at 40 accordingly.

    5.2 Visualization of Gesture Patterns

    As described in Section 3.2, the gesture patterns are ex-tracted in an unsupervised clustering manner. Herein, weaim at validating the effectiveness of the derived clustersfor representing the elementary gesture patterns. Since someextracted patterns could be as short as 16 sec (see Section 3.2),it is impossible for human observers to manually validateby watching these video clips. To approach this problemin a more systematic way, we visualize the variations ofgesture segments in each cluster in a low-dimensional space.We mainly consider the top six clusters which contain themost number of gesture segments (The average number

    of segments in each of the top six clusters is around 350,while the mean segment number in each of the remainingclusters is around 80). To this end, we first extract statis-tical functionals of each gesture feature within a segment,e.g., mean, standard deviation, maximum, minimum, me-dian, range, kurtosis and skewness. Hence, the variationsof gesture segments are represented by vectors of gesturestatistics. We further apply the parametric t-SNE, an un-supervised dimensionality reduction technique, to map thehigh-dimensional gesture variation space to a 2-dimensionallatent space. The parametric t-SNE learns the parametricmapping by optimally preserving the local data structure inthe low-dimensional latent space [42]. Fig. 7 visualizes the2-D representations of the segment-level gesture dynamicsin the top 6 clusters with respect to hand gesture and headmotion. In general, the variations of gesture segments in dif-ferent clusters are clearly separated. Such distinguishabilitybetween distinct clusters visually verifies the validity of thederived clusters for representing semantic gesture patterns.

    In addition, we manually examined the gesture seg-ments (longer than 1 sec) in the major six clusters withrespect to hand gesture and head motion by watching thevideo clips. The frequent hand gesture patterns includemoving both hands to the front, pointing with one’s left hand,pointing with one’s right hand, crossing arms in front of chest,opening arms and waving arms by the side. The commonhead motion patterns are raising one’s head, lowering one’shead, keeping still, turning to the left slightly, turning to theright slightly, and turning to the left mightily. Some of thesepatterns are congruent with the emotionally expressive ges-tures identified in the social psychology studies [6] [7]. Forexample, Wallbott found that the lateralized hand movementsare frequent during the anger emotion and moving the headdownward is the most typical for the disgust emotion [6].

    Fig. 7. Visualization of the dynamics of gesture segments of top sixclusters in the 2-dimensional latent space for hand gesture (left) andhead motion (right). Each circle represents the dynamics of each ges-ture segment, and colors of circles indicate the corresponding clusters.

    6 BASELINE MODELSWe evaluate our UBM-based method for gesture dynamicsmodeling by comparing with three baselines on automaticemotion recognition tasks. Two of the baselines are alsodeveloped based on the gesture patterns from Section 3.2.

    6.1 Low-level Physical DynamicsThe relationship between low-level physical dynamic cuesand emotions has been extensively investigated by re-

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 8

    searchers. Such low-level dynamics have demonstrated theeffectiveness for emotion recognition in some simple scenar-ios [18] [36] [43] [44]. A gesture feature vector f in our work(see Section 4.1) is represented by a series of 3D joint anglesand their 1st order derivatives (velocities). To computethe low-level dynamics, we also include their 2nd orderderivatives to describe acceleration. At the gesture sequencelevel, we consider mean, standard deviation, kurtosis, andskewness of the joint angles, velocities and accelerations.

    6.2 Markov-based DynamicsIn Section 5.1, we apply the bigram (“language”) model toevaluate how the transition dynamics of gesture sequencesdepend on the number of clusters. Herein, we employ theseMarkov-based evolution dynamics as one of our baselines.Besides the bigram dynamics, we compute the unigram fromeach sequence of gesture labels. A unigram describes aprobability (or frequency) distribution of clusters within asequence, and a bigram captures the local dependency be-tween adjacent gesture events. We further reduce the dimen-sionality of the bigram features using Principal ComponentAnalysis (PCA) by preserving 90% of the total variance.Such Markov-based dynamics have also been explored forattitude recognition [10].

    6.3 Graph-based DynamicsA graph-based framework for modeling gesture dynamicshas been proposed in [11]. In this framework, an undi-rected graph is constructed for each sequence with eachgraph node representing a gesture segment. The graphFourier transform (GFT) is subsequently applied to pro-duce gesture variability within each sequence. Similarlyto the classic Fourier transform, the graph-based descrip-tion is represented in different frequencies. Low-frequencyrepresentations describe the smoothness of a gesture se-quence, whereas the high-frequency ones capture the oscil-lations. Frequencies are further grouped into low- and high-frequency subbands. Similar statistical functionals, such asmean, median, maximum, or minimum, are extracted fromthe graph-based representations in each subband. Such sta-tistical features in the low- and high-frequency subbandsdefine the graph-based dynamics. As shown in [11], com-pared to Markov-based dynamics, Graph-based measurescould better capture long-term gesture variability.

    7 EMOTION RECOGNITION EXPERIMENTSIn this section, we evaluate our method on emotion recog-nition tasks. We conduct two experiments: intra-subjectemotion evaluation — to classify the emotion label (seeSection 4.2) of an actor over an interaction using one’sown gesture information (hand or head gesture); and inter-subject emotion evaluation — to classify the emotion labelof an actor using the gesture information of one’s interactionpartner. The first experiment focuses on showing the effec-tiveness of our method for gesture dynamics modeling; andthe second one aims at demonstrating the complementarynature of the cross-subject gesture dynamics captured inthe UBM-based approach. We use the leave-one-actor-outscheme, i.e., 16-fold cross validation since there are 16 actors

    TABLE 1Intra-subject accuracies (%) for recognizing 3-class (Chance: 33.3%)and 4-class (Chance: 25%) emotions using hand gesture. N is the

    number of salient gesture patterns used in the UBM-based approach.

    Method C = 3 C = 4

    BaselineLow-level 55.5 30.7

    Markov-based 47.7 39.0Graph-based 54.3 45.4

    UBM-basedACA 59.0 (N = 13) 47.1 (N = 9)

    HMMs 63.7(N = 13) 48.9 (N = 15)

    TABLE 2Intra-subject accuracies (%) for recognizing 3-class (Chance: 33.3%)

    and 4-class (Chance: 25%) emotions using head motion. N is thenumber of salient gesture patterns used in the UBM-based approach.

    Method C = 3 C = 4

    BaselineLow-level 45.8 35.8

    Markov-based 51.8 37.3Graph-based 56.5 41.2

    UBM-basedACA 58.4 (N = 5) 46.4 (N = 5)

    HMMs 61.0(N = 17) 50.5 (N = 7)

    in the database (see Section 4). We report the performanceaveraged over all the folds. The classification experimentsare performed using the multi-class SVM classifier with thedefined kernel in Eq. (9) for the UBM-based model andwith an RBF kernel for the three baselines. In each of the16 folds, the parameters of the SVMs are tuned exclusivelyon the training set by leaving one actor out. Specifically,the soft margin parameter c of the SVMs is evaluated at0.0001, 0.001, 0.01, 0.1 and 1. The scale parameters ofthe kernels, i.e., σcov and σp in the UBM-based methodand σ in the baseline methods, are tuned as 2q , whereq ∈ {−5,−4, · · · , 4, 5}. In the Markov-based method, we in-vestigated the performance using unigram, bigram and theircombination. The unigram features always perform the best.Similarly in the Graph-based approach, the performanceusing low- and high-frequency descriptions as well as theircombination was examined. The best results were achievedby the high-frequency features. In addition, we examine theperformance of the UBM-based method respectively usingthe parallel HMM structure and the aligned componentanalysis (ACA) [26] for the extraction of gesture patterns.In order to perform the ACA approach on the long-termgesture sequences, we reduce the number of frames in eachsequence following the technique in [26]. To be consistentwith the HMM structure, we use 50 hand gesture patternsand 40 head motion patterns in the ACA approach. Themaximum length of a segment in ACA is set to 200 which isalso the maximum segment length obtained using the HMMstructure.

    7.1 Experimental ResultsIntra-Subject Emotion RecognitionTable 1 presents the results of classifying the emotion labelof a participant into three and four emotional clusters usingone’s own hand gesture over an interaction. For the 3-clusterclassification, we obtain the best baseline accuracy of 55.5%

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 9

    when using the low-level dynamic features. Our method(HMMs) improves the performance to 63.7%. For the 4-cluster classification, the performance has been upgradedto 48.9% by the UBM-based model (HMMs), from the bestbaseline result of 45.4% with the Graph-based method.Similar results using head motion are shown in Table 2. TheUBM-based model (HMMs) achieves the best performanceof 61.0% and 50.5% for the 3-cluster and 4-cluster classifi-cation respectively.

    The experimental results first show that the statisticaldynamics derived by the UBM-based model outperform thebaselines in all the cases. Our framework explicitly selectsstatistically salient patterns by fitting each sequence to thesubject-independent variability model. Therefore, the statis-tical variation profile that is computed among the salientpatterns could robustly characterize gesture variations byexcluding individual idiosyncrasies and concentrating es-sential dynamics within a sequence. Furthermore, handgesture generally exhibits a higher discriminative power indistinguishing distinct emotions, compared to rigid headmotion. This may be due to a greater degree of handgesture variability as analyzed in Section 5.1. The richerexpressiveness of hand gesture provides more informationfor discriminating distinct emotion categories. In contrast,the lower degree of head motion variability could restrictthe emotion expression to some extent. Compared to handgesture, it is more likely that the same head motion isused for expressing different emotions, which may bringconfusions for emotion discrimination. We can also observethat Graph-based descriptors mostly outperform the othertwo baseline features using either hand gesture or headmotion, which may result from the benefits that the Graph-based model could better capture the long-term sequencevariations. Note that the Markov-based and Graph-basedbaselines as well as the UBM-based model are all basedon the high-level gesture patterns derived in Section 3.2.They generally exceed the low-level physical dynamicalcues in terms of the recognition performance. One implica-tion might be that the low-level physical dynamics are notsufficient to capture the great spatial-temporal variabilityof human gestures, especially in a long-term interpersonalinteraction.

    Inter-subject Emotion Recognition

    Table 3 and 4 present the results of classifying the emotionlabel of a subject using hand gesture or head motion ofone’s conversational partner. Generally, we can observe adegraded recognition performance using cross-subject cues,compared to using one’s own. However, all the methodsshow certain level of effectiveness in cross-subject emotionrecognition. This indicates the complementary nature ofcross-subject gesture behavior, i.e., the gesture behavior ofan interaction individual provides information about theemotional state of the another to some extent due to theinherent coupling of dyad’s mental and cognitive statesduring an interaction. In this experiment, our method stilloutperforms the baselines in all the cases. For example, it(HMMs) achieves the best accuracy of 60.6% and 48.8% us-ing hand gesture respectively in the 3-cluster and 4-clusterclassification. This improvement suggests that the gesture

    TABLE 3Inter-subject accuracies (%) for recognizing 3-class (Chance: 33.3%)and 4-class (Chance: 25%) emotions using hand gesture. N is the

    number of salient gesture patterns used in the UBM-based approach.

    Method C = 3 C = 4

    BaselineLow-level 47.6 38.8

    Markov-based 45.4 36.6Graph-based 57.3 43.5

    UBM-basedACA 57.6 (N = 7) 45.0 (N = 5)

    HMMs 60.6(N = 5) 48.8 (N = 5)

    TABLE 4Inter-subject accuracies (%) for recognizing 3-class (Chance: 33.3%)

    and 4-class (Chance: 25%) emotions using head motion. N is thenumber of salient gesture patterns used in the UBM-based approach.

    Method C = 3 C = 4

    BaselineLow-level 43.1 36.4

    Markov-based 44.3 38.6Graph-based 56.6 43.6

    UBM-basedACA 56.5 (N = 3) 43.2 (N = 7)

    HMMs 57.7(N = 5) 44.0 (N = 5)

    dynamics from the UBM-based model are informative ofboth intra-subject and cross-subject emotional states.

    It is interesting to observe that the baseline models insome cases achieve even higher performance of inter-subjectemotion recognition compared to that in the intra-subjecttasks. On the one hand, the better performance in theinter-subject tasks supports the existence of interpersonalcoordination of body gestures during dyadic interactions.On the other hand, this observation may also reveal weak-ness in the baseline approaches for body gesture modeling.The baseline representations contain both intra-subject andcross-subject characteristics simultaneously. For example,the same low-level physical dynamics are applied in eachemotion recognition task. As a result, the two types ofinformation are mutually influenced such that the modeleddynamics are even more informative regarding the inter-locutor’s state than regarding one’s own. In contrast, theUBM-based method selects gesture patterns separately withrespect to each evaluation task. The task-specific selectionprocedure may help disentangle the intra-subject and cross-subject factors, resulting in task-informative dynamical rep-resentations.

    In both intra-subject and inter-subject experiments, theperformance of the UBM-based method using ACA islower, compared to using the HMMs structure. This resultimplicitly demonstrates the effectiveness of the parallelHMM model for extracting semantic gesture patterns. Incontrast to the parallel HMM model, ACA performs seg-mentation and clustering respectively for each individualsequence, which may generate sequence-specific rather thanuse-generic gesture segments. Moreover, some importantdynamical cues may be removed in the down-samplingprocess before applying ACA.

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 10

    7.2 The Effect of Salient Gesture PatternsIn the UBM-based framework, there is one key parameterimpacting gesture dynamics modeling, i.e., the number ofsalient gesture segments, N , chosen for dynamics construc-tion. Investigating the influence of N on conveying theemotional state of an individual could help us to under-stand the way that human annotators summarize the localperceived gesture events to produce an overall emotionaljudgement about an interaction. The study of the perceptionmechanism of annotators is essential for behavioral sciencewhere human assessment is the main approach for variousresearch analyses [45] [46]. Herein, we examine how theemotion recognition performance is related to N .

    Fig. 8 presents the mean intra-subject recognition accu-racy in relation to the number of selected segments N , usinghand gesture and head motion respectively. We can observethat the performance generally increases in the beginningand then decreases as N rises. A better performance isusually achieved when N is around 15. This changingtrend indicates that a few local salient events, rather thanthe entire gesture sequence, are sufficiently informative totrigger the human summarization of the global emotionperception.

    Fig. 9 presents the mean inter-subject recognition accu-racy in relation to the number of selected segments N , usinghand gesture and head motion respectively. In contrastto Fig. 8, the inter-subject performance generally increasesmore rapidly in the beginning and starts falling from a rel-atively smaller value of N . The best performance is usuallyachieved at around N = 5. It is interesting to observe thatfewer salient segments are needed to summarize the globalrating of the cross-subject emotion, compared to those usedfor abstracting the intra-subject emotion. Published studieshave already shown that individuals tend to adjust theircommunication behavior, such as speech and lexical content,by leveraging the mental state and the behavior of one’s con-versational partner [47] [48] [49] [50], which is also validatedby the experimental results in Section 7.1. This observationfurther brings us the insight that the adaptation of one’sbody gestures to the emotional states of the correspondingconversational partner may occur occasionally, instead offrequently and continuously, in an interaction.

    8 CONCLUSION AND FUTURE WORKIn this work, we proposed a statistical framework for ro-bustly modeling body gesture dynamics in interpersonalinteractions. The proposed framework is composed of threestages. First, we construct a universal background model(UBM) using Gaussian mixture model (GMM) to representthe subject-independent gesture variability. Next, each ges-ture sequence is described as a concatenation of semanticgesture patterns using a parallel HMM structure. We furtherfit the segments of each gesture sequence with the UBM, inorder to select statistically prominent gesture patterns forthe sequence. The dynamics of each gesture sequence arerepresented by a statistical variation profile computed fromthe prominent segments, and are further described in a well-defined kernel space. The framework is flexible and general.Each of the components could be individually modified tosatisfy the needs of more complex tasks. For example, more

    5 10 15 2050

    52

    54

    56

    58

    60

    62

    64Hand gesture (C = 3)

    The number of selected segments N

    Intr

    a−

    subje

    ct re

    cognitio

    n a

    ccura

    cy

    5 10 15 2040

    42

    44

    46

    48

    50Hand gesture (C = 4)

    The number of selected segments N

    Intr

    a−

    subje

    ct re

    cognitio

    n a

    ccura

    cy

    5 10 1540

    45

    50

    55

    60

    65Head motion (C = 3)

    The number of selected segments N

    Intr

    a−

    subje

    ct re

    cognitio

    n a

    ccura

    cy

    5 10 1530

    35

    40

    45

    50

    55

    Intr

    a−

    subje

    ct re

    cognitio

    n a

    ccura

    cy

    The number of selected segments N

    Head motion (C = 4)

    Fig. 8. The intra-subject recognition performance varying with the num-ber of selected segments N .

    5 10 15 2048

    50

    52

    54

    56

    58

    60

    62

    The number of selected segments N

    Inte

    r−subje

    ct re

    cognitio

    n a

    ccura

    cy

    Hand gesture (C = 3)

    5 10 15 2042

    43

    44

    45

    46

    47

    48

    49Hand gesture (C = 4)

    Inte

    r−subje

    ct re

    cognitio

    n a

    ccura

    cy

    The number of selected segments N

    0 5 10 15 2046

    48

    50

    52

    54

    56

    58Head motion (C = 3)

    Inte

    r−subje

    ct re

    cognitio

    n a

    ccura

    cy

    The number of selected segments N5 10 15

    36

    37

    38

    39

    40

    41

    42

    43

    44

    The number of selected segments N

    Inte

    r−subje

    ct re

    cognitio

    n a

    ccura

    cy

    Head motion (C = 4)

    Fig. 9. The inter-subject recognition performance varying with the num-ber of selected segments N .

    advanced techniques, such as deep neural networks [51],could be used for the global gesture variability modeling,when a large amount of data is available.

    We evaluated our model in the emotion recognitionexperiments and compared with three baseline models. Weconducted two experiments: intra-subject and inter-subjectemotion evaluation. In the experiments, we considered twotypes of expressive gestures: head motion and hand ges-ture. According to the experimental results, our proposedUBM-based framework shows superiority over the baselinemodels in all cases. This could be attributed to the fittingof each gesture sequence to the UBM which unifies theinformation of different sequences and concentrates essen-

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 11

    tial dynamics. The statistical fitting may robustly charac-terize gesture variations by removing individual-specificidiosyncrasies within a sequence. We also observed thatthe gesture pattern based models generally outperformedthe low-level physical gesture dynamics. Though humangestures are highly variable and complex in nature in termsof both spatial and temporal structures, this observationcorroborates the flexibility and advantage of representing acomplex gesture sequence with elementary gesture patterns.In addition, hand gesture demonstrates a greater ability forexpressing emotions compared to head motion. As shownin Fig. 4, head motion is determined by three degrees offreedom, while hand gesture is conditioned on 12 indepen-dent variables. The higher degree of hand gesture variabilitymay introduce richer expressiveness during communica-tion. Inspired by this finding, future work on expressivegesture animation could especially focus on synthesizingexpressions of hand gesture. Furthermore, it is interestingto observe that our framework also shows effectiveness inthe cross-subject emotion evaluation, due to the nature ofthe complex behavior coordination between the interactingpartners.

    Analysis of the relationship between the emotion recog-nition performance and the number of the salient gesturepatterns indicates the underlying structure of human sum-marization process of the global emotion perception. Wefound that a better recognition performance can be achievedwith a few gesture patterns, compared to using the entiregesture sequence. Hence, a few local salient events are suffi-ciently informative to inform the human summarization ofthe global emotion perception over an interaction. Anotherinteresting observation is that fewer salient segments areneeded to summarize the cross-subject emotion, comparedto those used for abstracting the intra-subject emotion.Since cross-subject emotion evaluation implies the behavioradaptation of an individual towards the emotional stateof the corresponding conversational partner, this observa-tion brings us the insight that such adaptation may occursporadically, instead of frequently and continuously, in aninteraction.

    It is worth noting that the recognition performance inthe experiments may be not as high as expected for typicalacted interactions. However, unlike other acted data [18][22], the formal design of the CreativeIT database is basedon the theatrical improvisation technique of Active Anal-ysis pioneered by Stanislavsky. The key element of ActiveAnalysis is that actors need to keep a verb (e.g., to persuadeor to approach) in mind, which drives their actions dur-ing the performance. As a result, different communicationmanifestations, such as emotions, attitudes, and speech orbody gestures, can be naturally elicited through the courseof the interaction. The acted interactions in our databaseare therefore closer to natural interpersonal communication.The elicited naturalness of actors’ behavior induces moredifficulty in body gesture modeling, and hence leads to alower performance than typically expected.

    The proposed framework is especially suitable for rep-resenting long-term body gestures with high dynamicalcomplexity, such as the interaction-context body gestures inour case. However, it is also applicable to model dynamicsof general body gestures that occur in any other contexts.

    As a future direction, it would be interesting to extend ourmodel particularly for the body gestures from interpersonalcommunication by incorporating interaction-context factors.For example, we could simultaneously consider the dynam-ics of the interlocutor while modeling the body gestures ofan interaction participant.

    Another future direction based on the proposed modelcould particularly focus on analyzing the role of differentbody parts in expressing emotions. This direction could beapproached by studying the emotion recognition perfor-mance with dynamics of joint-body (e.g., head-hand) gesturepatterns or with dynamics combinations of different bodyparts, which could aid in the development of automaticemotion recognition systems as well as human-machineinterfaces.

    The effectiveness of body gestures for emotion recogni-tion implies the possibility of expressive gesture animation.One long-term goal for future work is to animate body mo-tion for a virtual agent which could express one’s emotionalstate towards the human interlocutor. Since the proposedframework for gesture dynamics modeling is based ondiscrete gesture patterns, it could be well incorporated intothe traditional pattern-based animation approaches where acomplex gesture is generated by continuously concatenatingselected patterns. We have also demonstrated that humanthe global perception of emotions is triggered by a fewevents over an interaction. Therefore, to achieve a pleasantand natural human-agent conversation, it is also possibleto create an intelligent agent which can instantly sense theemotional state of the human user from very few prominentgesture events in the beginning of the conversation, adaptthe dialog strategy accordingly and respond appropriately.

    REFERENCES

    [1] J. Harrigan, R. Rosenthal, and K. Scherer, The new handbook ofMethods in Nonverbal Behavior Research. Oxford Univ. Press, 2005.

    [2] A. Kleinsmith and N. Bianchi-Berthouze, “Affective body expres-sion perception and recognition: A survey,” Affective Computing,IEEE Transactions on, vol. 4, no. 1, pp. 15–33, 2013.

    [3] J. Wachs, M. Kölsch, H. Stern, and Y. Edan, “Vision-based hand-gesture applications,” Communications of the ACM, vol. 54, no. 2,pp. 60–71, 2011.

    [4] H. K. Meeren, C. C. van Heijnsbergen, and B. de Gelder, “Rapidperceptual integration of facial expression and emotional bodylanguage,” Proceedings of the National Academy of Sciences of theUnited States of America, vol. 102, no. 45, pp. 16 518–16 523, 2005.

    [5] D. Bernhardt and P. Robinson, “Detecting affect from non-stylisedbody motions,” in Affective Computing and Intelligent Interaction.Springer, 2007, pp. 59–70.

    [6] H. G. Wallbott, “Bodily expression of emotion,” European journalof social psychology, vol. 28, no. 6, pp. 879–896, 1998.

    [7] M. De Meijer, “The contribution of general features of bodymovement to the attribution of emotions,” Journal of Nonverbalbehavior, vol. 13, no. 4, pp. 247–268, 1989.

    [8] C. Conati, “Probabilistic assessment of user’s emotions in educa-tional games,” Applied Artificial Intelligence, vol. 16, no. 7-8, pp.555–575, 2002.

    [9] A. Kendon, “Gesticulation and speech: Two aspects of the processof utterance,” The relationship of verbal and nonverbal communication,vol. 25, pp. 207–227, 1980.

    [10] Z. Yang, A. Metallinou, E. Erzin, and S. Narayanan, “Analysis ofinteraction attitudes using data-driven hand gesture phrases,” inICASSP, 2014, pp. 699–703.

    [11] Z. Yang, A. Ortega, and S. Narayanan, “Gesture dynamics mod-eling for attitude analysis using graph based transform,” in IEEEInternational Conference on Image Processing, 2014, pp. 1515–1519.

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 12

    [12] A. Metallinou, C.-C. Lee, C. Busso, S. Carnicke, and S. Narayanan,“The USC CreativeIT database: A multimodal database of the-atrical improvisation,” in Proc. of Multimodal Corpora: Advances inCapturing, Coding and Analyzing Multimodality (MMC), 2010.

    [13] A. Metallinou, Z. Yang, C.-C. Lee, C. Busso, S. Carnicke, andS. Narayanan, “The USC CreativeIT database of multimodaldyadic interactions: From speech and full body motion captureto continuous emotional annotations,” Language resources and eval-uation, 2015.

    [14] A. Mehrabian and J. T. Friar, “Encoding of attitude by a seatedcommunicator via posture and position cues.” Journal of Consultingand Clinical Psychology, vol. 33, no. 3, p. 330, 1969.

    [15] A. Melinger and W. J. Levelt, “Gesture and the communicativeintention of the speaker,” Gesture, vol. 4, no. 2, pp. 119–141, 2004.

    [16] D. McNeill, Gesture and Thought. University of Chicago Press,2008.

    [17] M. Karg, K. Kuhnlenz, and M. Buss, “Recognition of affect basedon gait patterns,” Systems, Man, and Cybernetics, Part B: Cybernetics,IEEE Transactions on, vol. 40, no. 4, pp. 1050–1061, 2010.

    [18] A. Kapur, V.-B. Naznin, G. Tzanetakis, and P. F. Driessen,“Gesture-based affective computing on motion capture data,” inAffective Computing and Intelligent Interaction, 2005, pp. 1–7.

    [19] A. Metallinou, A. Katsamanis, and S. Narayanan, “Tracking con-tinuous emotional trends of participants during affective dyadicinteractions using body language and speech information,” Imageand Vision Computing, vol. 31, no. 2, pp. 137–152, 2013.

    [20] M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous predic-tion of spontaneous affect from multiple cues and modalities invalence-arousal space,” Affective Computing, IEEE Transactions on,vol. 2, no. 2, pp. 92–105, 2011.

    [21] N. Savva and N. Bianchi-Berthouze, “Automatic recognition ofaffective body movement in a video game scenario,” in IntelligentTechnologies for Interactive Entertainment, 2012, pp. 149–159.

    [22] D. Bernhardt and P. Robinson, “Detecting emotions from con-nected action sequences,” in Visual Informatics: Bridging Researchand Practice. Springer, 2009, pp. 1–11.

    [23] A. Camurri, B. Mazzarino, M. Ricchetti, R. Timmers, and G. Volpe,“Multimodal analysis of expressive gesture in music and danceperformances,” in Gesture-based communication in human-computerinteraction. Springer, 2004, pp. 20–39.

    [24] S. Levine, C. Theobalt, and V. Koltun, “Real-time prosody-drivensynthesis of body language,” in ACM Transactions on Graphics,vol. 28, no. 5, 2009, p. 172.

    [25] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J. Hodgins, andN. Pollard, “Segmenting motion capture data into distinct behav-iors,” in Proc. of Graphics Interface, 2004, pp. 185–194.

    [26] F. Zhou, F. D. la Torre, and J. Hodgins, “Hierarchical alignedcluster analysis for temporal clustering of human motion,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 35,no. 3, pp. 582–596, 2013.

    [27] M. Sargin, Y. Yemez, E. Erzin, and A. Tekalp, “Analysis of headgesture and prosody patterns for prosody-driven head-gestureanimation,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 30, no. 8, pp. 1330–1345, 2008.

    [28] F. Ofli, E. Erzin, Y. Yemez, and A. M. Tekalp, “Learn2Dance:Learning Statistical Music-to-Dance Mappings for ChoreographySynthesis,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 747–759, 2012.

    [29] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vec-tor machines using gmm supervectors for speaker verification,”Signal Processing Letters, IEEE, vol. 13, no. 5, pp. 308–311, 2006.

    [30] D. A. Reynolds, “An overview of automatic speaker recognition,”in ICASSP, 2002, pp. 4072–4075.

    [31] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang, “Probabilistic elasticmatching for pose variant face verification,” in Computer Vision andPattern Recognition, 2013, pp. 3499–3506.

    [32] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionletson spatio-temporal manifold for dynamic facial expression recog-nition,” in Computer Vision and Pattern Recognition, 2014, pp. 1749–1756.

    [33] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Geometric meansin a novel vector space structure on symmetric positive-definitematrices,” SIAM journal on matrix analysis and applications, vol. 29,no. 1, pp. 328–347, 2007.

    [34] S. M. Carnicke, Stanislavsky in Focus: An Acting Master for theTwenty-First Century. Routledge, UK, 2008.

    [35] I. Guide, “Autodesk R©,” 2008.

    [36] D. Glowinski, A. Camurri, G. Volpe, N. Dael, and K. Scherer,“Technique for automatic emotion recognition by body gestureanalysis,” in Computer Vision and Pattern Recognition Workshops,2008, pp. 1–6.

    [37] S. Mariooryad and C. Busso, “Exploring cross-modality affectivereactions for audiovisual emotion recognition,” Affective Comput-ing, IEEE Transactions on, vol. 4, no. 2, 2013.

    [38] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller,and S. Narayanan, “Context-sensitive learning for enhanced au-diovisual emotion classification,” Affective Computing, IEEE Trans-actions on, vol. 3, no. 2, pp. 184–198, 2012.

    [39] M. Wöllmer, F. Eyben, B. Schuller, E. Douglas-Cowie, andR. Cowie, “Data-driven clustering in emotional space for affectrecognition using discriminatively trained lstm networks.” in IN-TERSPEECH, 2009, pp. 1595–1598.

    [40] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ol-lason, V. Valtchev, and P. Woodland, “The HTK book,” CambridgeUniversity Engineering Department, vol. 3, p. 175, 2002.

    [41] U.-V. Marti and H. Bunke, “On the influence of vocabulary sizeand language models in unconstrained handwritten text recogni-tion,” in Proc. of ICDAR, 2001, pp. 260–265.

    [42] L. Maaten, “Learning a parametric embedding by preserving localstructure,” in International Conference on Artificial Intelligence andStatistics, 2009, pp. 384–391.

    [43] D. Glowinski, N. Dael, A. Camurri, G. Volpe, M. Mortillaro,and K. Scherer, “Toward a minimal representation of affectivegestures,” Affective Computing, IEEE Transactions on, vol. 2, no. 2,pp. 106–118, 2011.

    [44] J. Sanghvi, G. Castellano, I. Leite, A. Pereira, P. W. McOwan,and A. Paiva, “Automatic analysis of affective postures and bodymotion to detect engagement with a game companion,” in Human-Robot Interaction (HRI), 2011, pp. 305–311.

    [45] G. Humphrey, “The psychology of the gestalt.” Journal of Educa-tional Psychology, vol. 15, no. 7, p. 401, 1924.

    [46] K. M. Lindahl, P. Kerig, and K. Lindahl, “Methodological issues infamily observational research,” Family observational coding systems:Resources for systemic research, pp. 23–32, 2001.

    [47] R. Levitan, A. Gravano, and J. Hirschberg, “Entrainment in speechpreceding backchannels,” in Association for Computational Linguis-tics: Human Language Technologies, 2011, pp. 113–117.

    [48] B. Xiao, P. G. Georgiou, B. Baucom, and S. Narayanan, “Head mo-tion modeling for human behavior analysis in dyadic interaction,”IEEE Transactions on Multimedia, vol. 17, no. 7, pp. 1107–1119, 2015.

    [49] C.-C. Lee, A. Katsamanis, M. P. Black, B. R. Baucom, A. Chris-tensen, P. G. Georgiou, and S. S. Narayanan, “Computing vocalentrainment: A signal-derived pca-based quantification schemewith application to affect analysis in married couple interactions,”Computer Speech & Language, vol. 28, no. 2, pp. 518–539, 2014.

    [50] B. Xiao, B. Baucom, P. Georgiou, and S. Narayanan, “Modelinghead motion entrainment for prediction of couples’ behavioralcharacteristics,” in Affective Computing and Intelligent Interaction,Xian, China, 2015.

    [51] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deepneural networks for acoustic modeling in speech recognition,”Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.

    Zhaojun Yang is a Ph.D. candidate in Electri-cal Engineering at the University of SouthernCalifornia (USC). She received her B.E. Degreein Electrical Engineering from University of Sci-ence and Technology of China (USTC) 2009and M.Phil. Degree in Systems Engineering andEngineering Management from Chinese Univer-sity of Hong Kong (CUHK) 2011. Her researchinterests include multimodal emotion recognitionand analysis, interaction modeling and spokendialog system. She was awarded with the USC

    Annenberg Fellowship (2011-2015).

  • JOURNAL OF AFFECTIVE COMPUTING, VOL. XX, NO. XX, APRIL 2015 13

    Shrikanth S. Narayanan (S88-M95-SM02-F09)is Andrew J. Viterbi Professor of Engineeringat the University of Southern California (USC),and holds appointments as Professor of Electri-cal Engineering, Computer Science, Linguistics,Psychology Neuroscience and Pediatrics and asthe founding director of the Ming Hsieh Institute.Prior to USC he was with AT&T Bell Labs andAT&T Research from 1995-2000. At USC he di-rects the Signal Analysis and Interpretation Lab-oratory (SAIL). His research focuses on human-

    centered signal and information processing and systems modeling withan interdisciplinary emphasis on speech, audio, language, multimodaland biomedical problems and applications with direct societal relevance.[http://sail.usc.edu]

    Prof. Narayanan is a Fellow of the Acoustical Society of America andthe American Association for the Advancement of Science (AAAS) anda member of Tau Beta Pi, Phi Kappa Phi, and Eta Kappa Nu. He is alsoan Editor in Chief for the IEEE Journal of Selected Topics in Signal Pro-cessing and an Editor for the Computer Speech and Language Journaland an Associate Editor for the IEEE TRANSACTIONS ON AFFECTIVECOMPUTING, IEEE TRANSACTIONS ON SIGNAL AND INFORMA-TION PROCESSING OVER NETWORKS, APSIPA TRANSACTIONSON SIGNAL AND INFORMATION PROCESSING and the Journal of theAcoustical Society of America. He was also previously an Associate Edi-tor of the IEEE TRANSACTIONS OF SPEECH AND AUDIO PROCESS-ING (2000- 2004), IEEE SIGNAL PROCESSING MAGAZINE (2005-2008) and the IEEE TRANSACTIONS ON MULTIMEDIA (2008-2011).He is a recipient of a number of honors including Best TransactionsPaper awards from the IEEE Signal Processing Society in 2005 (withA. Potamianos) and in 2009 (with C. M. Lee) and selection as an IEEESignal Processing Society Distinguished Lecturer for 2010-2011 andISCA Distinguished Lecturer for 2015-2016. Papers co-authored withhis students have won awards including the 2014 Ten-year TechnicalImpact Award from ACM ICMI and at Interspeech 2015 Nativeness De-tection Challenge, 2014 Cognitive Load Challenge, 2013 Social SignalChallenge, Interspeech 2012 Speaker Trait Challenge, Interspeech 2011Speaker State Challenge, InterSpeech 2013 and 2010, InterSpeech2009 Emotion Challenge, IEEE DCOSS 2009, IEEE MMSP 2007, IEEEMMSP 2006, ICASSP 2005 and ICSLP 2002. He has published over650 papers and has been granted seventeen U.S. patents.