Image Comm Pardas Bonafonte

download Image Comm Pardas Bonafonte

of 30

Transcript of Image Comm Pardas Bonafonte

  • 7/29/2019 Image Comm Pardas Bonafonte

    1/30

    FACIAL ANIMATION PARAMETERS EXTRACTION AND EXPRESSION

    RECOGNITION USING HIDDEN MARKOV MODELS

    Montse Pards, Antonio Bonafonte

    Universitat Politcnica de CatalunyaC/Jordi Girona 1-3, D-5

    08034 Barcelona Spain

    email: {montse,antonio}@gps.tsc.upc.es

    This work has been supported by the European project InterFace and TIC2001-0996 of the Spanish

    Government

    The video analysis system described in this paper aims at facial expression recognition

    consistent with the MPEG4 standardized parameters for facial animation, FAP. For this

    reason, two levels of analysis are necessary: low level analysis to extract the MPEG4

    compliant parameters and high level analysis to estimate the expression of the sequence

    using these low level parameters.

    The low level analysis is based on an improved active contour algorithm that uses high

    level information based on Principal Component Analysis to locate the most significant

    contours of the face (eyebrows and mouth), and on motion estimation to track them.

    The high level analysis takes as input the FAP produced by the low level analysis tool

    and, by means of a Hidden Markov Model classifier, detects the expression of the

    sequence.

    1. INTRODUCTION

    The critical role that emotions play in rational decision-making, in perception and in

    human interaction has opened an interest in introducing the ability to recognize and

  • 7/29/2019 Image Comm Pardas Bonafonte

    2/30

    reproduce emotions in computers. In [26] many applications which could benefit from

    this ability are explored. The importance of introducing non verbal communication in

    automatic dialogue systems is highlighted in [4] and [5]. However, the research that has

    been carried out on understanding human facial expressions cannot be directly applied

    to a dialogue system, as it has mainly worked on static images or on image sequences

    where the subjects show a specific emotion but there is no speech. Psychological studies

    have indicated that at least six emotions are universally associated with distinct facial

    expressions: happiness, sadness, surprise, fear, anger and disgust [32]. Several other

    emotions and many combinations of emotions have been studied but remain

    unconfirmed as universally distinguishable. Thus, most of the research up to now has

    been oriented towards detecting these six basic expressions.

    Some research has been conducted on pictures that capture the subjects expression at

    its peak. These pictures allow detecting the presence of static cues (such as wrinkles) as

    well as the position and shape of the facial features [32]. For instance, in [23] and [7]

    classifiers for facial expressions were based on neural networks. Static faces were

    presented to the network as projections of blocks from feature regions onto the principal

    component space generated from the image data set. They obtain around 86% accuracy

    in distinguishing the six basic emotions.

    Research has also been conducted on the extraction of facial expressions from video

    sequences. Most works in this area develop a video database from subjects making

    expressions on demand. Their aim has been to classify the six basic expressions that we

    have mentioned above, and they have not been tested in a dialogue environment. Most

    approaches in this area rely on the Ekman and Friesen Facial Action Coding System

    [10]. The FACS is based on the enumeration of all Action Units of a face that cause

    facial movements. The combination of these actions units results in a large set of

  • 7/29/2019 Image Comm Pardas Bonafonte

    3/30

    possible facial expressions. In [31] the directions of rigid and non-rigid motions that are

    caused by human facial expressions are identified computing the optical flow at the

    points with high gradient at each frame. Then they propose a dictionary to describe the

    facial actions of the FACS through the motion of the features and a rule-based system to

    recognize facial expressions from these facial actions. A similar approach is presented

    in [3], where a local parameterized model of the image motion in specific facial areas is

    used for recovering and recognising the non-rigid and articulated motion of the faces.

    The parameters of this detected motion are related to the motion of facial features

    during facial expressions. In [28] a radial basis function network architecture is

    developed that learns the correlation between facial feature motion patterns and human

    emotions. The motion of the features is also obtained computing the optical flow at the

    points with high gradient. The accuracy of all these systems is also around 85%.

    Other approaches employ physically-based models of heads including skin and

    musculature [11], [12]. They combine this physical model with registered optical flow

    measurements from human faces to estimate the muscle actuations. They propose two

    approaches, the first one creates typical patterns of muscle actuation for the different

    facial expressions. A new image sequence is classified by its similarity to the typical

    patterns of muscle actuation. The second one builds on the same methodology to

    generate, from the muscle actuation, the typical pattern of motion energy associated

    with each facial expression. In [19] a system is presented that uses facial feature point

    tracking, dense flow tracking and high gradient component analysis in the spatio-

    temporal domain to extract the FACS Action Units, from which expressions can be

    derived. A complete review of this techniques can be found in [24]

    In this work we have developed an expression recognition technique consistent with the

    MPEG4 standardized parameters for facial definition and animation, FDP and FAP.

  • 7/29/2019 Image Comm Pardas Bonafonte

    4/30

    Thus, the expression recognition process can be divided in two steps: facial parameter

    extraction and facial parameter analysis, which we also refer as low level and high level

    analysis. The facial parameter extraction process is based on a feature point detection

    and tracking system which uses active contours. Conventional active contours (snakes)

    approaches [16] find the position of the snake by finding a minimum of its energy,

    composed of internal and external forces. The external forces pull the contours toward

    features such as lines and edges. However, in many applications this minimization leads

    to contours that do not represent correctly the feature we are looking for. We propose in

    this paper to introduce some higher level information by a statistical characterization of

    the snaxels that should represent the contour. From the automatically produced

    initialization, the MPEG-4 compliant FAP are computed. These FAP will be used for

    training of the expressions extraction system. These techniques are developed in Section

    2.

    Using the MPEG4 parameters for the analysis of facial emotions has different

    advantages. In first place, the developed high level analysis can benefit from already

    existing low level analysis techniques for FDP and FAP extraction, as well as from any

    advances that will be made in this area in the future years. Besides, the low-level FAP

    constitute a concise representation of the evolution of the expression of the face. From

    the training database, and using the available FAP, spatio-temporal patterns for

    expressions will be constructed. Our approach for the interpretation of facial

    expressions will use Hidden Markov Models (HMM) to recognize different patterns of

    FAP evolution. For every defined facial expression, a HMM will be trained with the

    extracted feature vectors. In the recognition phase, the HMM system will use as

    classification criterion the probability of the input sequence with respect to all the

    models. Tests are also performed in sequences composed of different emotions and with

  • 7/29/2019 Image Comm Pardas Bonafonte

    5/30

    periods of speech, using connected HMM. This process will be explained in Section 3.

    Finally, Section 4 will resume the conclusions of this work.

    2. LOW LEVEL ANALYSIS

    This Section describes the low level video processing required for the extraction of

    Facial Animation Parameters (FAPs). The first step in this analysis is the face detection.

    Different techniques can be used for this aim, like [17] or [30], and we will not go into

    details of this part. After the face is detected, facial features have to be located and

    tracked. Two different approaches are possible: contour based or model based. The first

    approach will localize and track the most important contours of the face and track them

    in 2D (that is, track their projection in the image plane). The second approach consists

    in the adaptation, in each frame, of a 3D wireframe model of a face to the image.

    Example of this second approach can be found in [1], [6], [8] and [9]. In this work we

    will use a technique belonging to the first class because, as it will be explained later on,

    we will work on a restricted environment. Facial Animation Parameters (FAPs) will be

    computed from the tracked contours or from the adapted model. First, in section 2.1 we

    will review the meaning of these parameters. The following subsections are devoted to

    present a 2D technique for the extraction of these parameters, which uses active

    contours. So, section 2.1 will review the general framework of active contours, section

    2.2 will apply this technique to facial feature detection and section 2.3 to facial feature

    tracking.

    2.1 FACIAL ANIMATION PARAMETERS

    Facial Animation Parameters (FAPs) are defined in the ISO MPEG-4 standard [21],

    together with the Facial Definition Parameters (FDPs), to allow the definition of a facial

  • 7/29/2019 Image Comm Pardas Bonafonte

    6/30

    shape and its animation reproducing expressions, emotions, and speech pronunciation.

    FDPs are illustrated in Figure 1 (figures 1 and 2 have been provided by Roberto Pockaj,

    from Genova University), they represent key points in a human face. All the feature

    points can be used for the calibration of a model to a face, while only those represented

    by black dots in the image are used also for the animation (that is, there are FAPs

    describing its motion).

    The FAPs are based on the study of minimal facial actions and are closely related to

    muscle actions. They represent a complete set of basic facial actions, such as squeeze or

    raise eyebrows, open or close eyelids, and therefore allow the representation of most

    natural expressions. All FAPs involving translational movement are expressed in terms

    of the Facial Animation Parameter Units (FAPUs). These units aim at allowing

    interpretation of the FAPs on any facial model in a consistent way, producing

    reasonable results in terms of expression and speech pronunciation. FAPUs are

    illustrated in Figure 1 and correspond to fractions of distances between some key facial

    features.

    We will be interested in those FAPs which a) convey important information about the

    emotion of the face and b) can be reliably extracted from a natural video sequence. In

    particular, we have focused in those FAPs related to the motion of the contour of the

    eyebrows and the mouth. These FAPs are specified in Table 1.

  • 7/29/2019 Image Comm Pardas Bonafonte

    7/30

    Figure 1. Facial Definition Parameters

    FAPU value

    IRISD = IRISD0 / 1024

    ES = ES0 / 1024

    ENS = ENS0 / 1024

    MNS = MNS0 / 1024

    MW = MW0 / 1024

    AU = 10-5 rad

    Figure 2. Facial Animation Parameter Units

    # FAP name FAP description Unit

    31 raise_l_i_eyebrow Vertical displacement of left inner eyebrow ENS

    32 raise_r_i_eyebrow Vertical displacement of right inner eyebrow ENS

    33 raise_l_m_eyebrow Vertical displacement of left middle eyebrow ENS

    34 raise_r_m_eyebrow Vertical displacement of right middle eyebrow ENS

    35 raise_l_o_eyebrow Vertical displacement of left outer eyebrow ENS

  • 7/29/2019 Image Comm Pardas Bonafonte

    8/30

    36 raise_r_o_eyebrow Vertical displacement of right outer eyebrow ENS

    37 squeeze_l_eyebrow Horizontal displacement of left eyebrow ES

    38 squeeze_r_eyebrow Horizontal displacement of right eyebrow ES

    51 lower_t_midlip _o Vertical top middle outer lip displacement

    MNS

    52 raise_b_midlip_o Vertical bottom middle outer lip displacementMNS

    53 stretch_l_cornerlip_o Horizontal displacement of left outer lip cornerMW

    54 stretch_r_cornerlip_o Horizontal displacement of right outer lip corner MW

    55 lower_t_lip_lm _o Vertical displacement of midpoint between left corner

    and middle of top outer lip

    MNS

    56 lower_t_lip_rm _o Vertical displacement of midpoint between right

    corner and middle of top outer lip

    MNS

    57 raise_b_lip_lm_o

    Vertical displacement of midpoint between left corner

    and middle of bottom outer lip MNS

    58 raise_b_lip_rm_oVertical displacement of midpoint between right

    corner and middle of bottom outer lipMNS

    59 raise_l_cornerlip_o Vertical displacement of left outer lip corner MNS

    60 raise_r_cornerlip _o Vertical displacement of right outer lip corner MNS

    Table 1. FAP table for the eyebrows

    To extract these FAPs in frontal faces, the tracking of the 2-D contour of the eyebrows

    and mouth is sufficient. If the sequences present rotation and translation of the head,

    then a system based on a 3D face model tracking would be more robust. In our case the

    FAPs are used to train the HMM classifier. For this reason, it is better to constrain the

    kind of sequences that are going to be analyzed. We will use frontal faces and a

    supervised tracking procedure, in order not to introduce any errors in the training of the

    system. Once the system has been trained, any FAP extraction tool can be used. If the

    results it produces are accurate, the high level analysis will have a higher probability of

    success. The low level analysis system that we propose for training can also be used for

    testing with other sequences, as long as the faces are frontal or we previously apply a

    global motion estimation for the head.

  • 7/29/2019 Image Comm Pardas Bonafonte

    9/30

    The detection and tracking algorithm that will be explained in the next section has been

    integrated in a Graphical User Interface (GUI) that supports corrections to the results

    being produced by the automatic algorithm. From the detection of facial features

    produced, it computes the FAP units and then from the results of the automatic tracking

    it writes the MPEG-4 compliant FAP files. These FAP files will be used for training of

    the expressions extraction system.

    The algorithm is applied to tracking of eyebrows and mouth, thus producing the

    MPEG4 FAPs for the eyebrows 31 32 33 34 35 36 37 38 and for the mouth 51 52 53 54

    55 56 57 58 59 60, described in Table 1.

    2.2 ACTIVE CONTOURS

    2.2.1 Introduction

    Active contours were first introduced by Kass et al. [16]. They proposed energy

    minimization as a framework where low-level information (such as image gradient or

    image intensity) can be combined with higher-level information (such as shape,

    continuity of the contour or user interactivity). In their original work the energy

    minimization problem was solved using a variational technique. In [2] Amini et al.

    proposed Dynamic Programming (DP) as a different solution to the minimization

    problem. The use of the Dynamic Programming approach will allow us to introduce the

    concept of candidates for the control points of the contour, thus avoiding the risk to

    fall on local minima located near the initial position.

    In [25], we proposed a first approach for tracking facial features, based on dynamic

    programming, introducing a new term in the energy of the snake, and selecting the

    candidate pixels for the contour (snaxels) using motion estimation. Currently the DP

  • 7/29/2019 Image Comm Pardas Bonafonte

    10/30

    approach has been extended to be able to automatically initialize the facial contours. In

    this case, the candidate snaxels are selected by a statistical characterization of the

    contour based on Principal Component Analysis (PCA). In this subsection we will

    review the basic formulation of active contours, while the next subsections explain how

    we apply them for facial feature detection (2.3) and for facial feature tracking (2.4).

    2.2.2 Active contours formulation

    In the discrete formulation of active contour models a contour is represented as a set of

    snaxels i=(xi,yi) for i=0,,N-1, where xi and yi are the x and y coordinates of thesnaxel i. The energy of the contour, which is going to be minimized, is defined by:

    (

    =

    +=1N

    0i

    int )()()( ii vvv extsnake EEE ) (1)

    We can use a discrete approximation of the second derivative to computeEint.

    1ii1ii vvvv + += 2)(intE (2)

    This is an approximation to the curvature of the contour at snaxel i, if the snaxels are

    equidistant. Minimizing this energy will produce smooth curves. This is only

    appropriate for snaxels that are not corners of the contour. More appropriated

    definitions for the internal energy will be proposed in Section 2.3.2 for the initialisation

    of the snake and in Section 2.4.2 for tracking.

    The purpose of the termEext is to attract the snake to desired feature points or contours

    in the image. In this work we have used the gradient of the image intensity (I(x,y)),

    along the contour from vi to vi+1. Thus, the Eext at snaxel vi will depend only on the

    position of the snaxels viand vi+1. That is,

  • 7/29/2019 Image Comm Pardas Bonafonte

    11/30

    ),,()( 1iivvi vvIv 1i + == + fEE icontext (3)

    However, a new term will also be added to the external energy, in order to be able to

    track contours that have stronger edges nearby.

    2.2.3 Dynamic programming

    We will use the DP approach to minimize the energy in Eq. (1). Let us express the

    Energy of the snake remarking the dependencies of its terms:

    [ ]

    =

    +

    =

    ++ =+=1

    0

    1

    0

    int ),,(),(),,()(N

    i

    i

    N

    i

    extsnake EEEE 1ii1i1ii1ii1i vvvvvvvvv (4)

    Although snakes can be open or closed, the DP approach can be applied directly only to

    open snakes. To apply DP to open snakes, the limits of Eq. 4 are adjusted to 1 and N-2

    respectively.

    Now, as described in [2], this energy can be minimized via discrete DP defining a two-

    element vector of state variables in the ith decision stage: (vi+1, vi). The optimal value

    function is a function of two adjacent points on the contour Si(vi+1,vi), and can be

    calculated, for every couple of possible candidate positions for snaxels vi+1and vi, as:

    [ ),,(),(min),( 11

    1ii1i1iii1i vvvvvvv ++ +=

    iiv

    i ESSi

    ] (5)

    S0(v1, v0) is initialized to Eext(v0,v1) for every possible candidate pair (v0,v1) and from

    this, Si can be computed iteratively from i=1 up to i=N-2 for every candidate position

    forvi. The total energy of the snake will be

    (6))(min)( 2,121

    = NNNv

    snake SEN

    vvv

    Besides, we have to store at every step i a matrix which stores the position of vi-1 that

    minimizes Eq. (5), that is,

  • 7/29/2019 Image Comm Pardas Bonafonte

    12/30

    such that v1ii1ii vvvM + =),( i-1minimizes (5).

    By backtracking from the final energy of the snake and using matrix Mi, the optimal

    position for every snaxel can be found.

    In the case of a closed contour the solution proposed in [13] is to impose the first and

    last snaxels to be the same, and fix it to a given candidate for this position. The

    application of the DP algorithm will produce the best result under this restriction. Then

    this initial (and final) snaxel is successively changed to all the possible candidates, and

    the one that produces a smaller energy is selected. We use an approximation proposed

    in [14] that requires only two open contour optimisation steps.

    2.2.4 Selection of candidates

    Up to now, we have assumed that for every snaxel vi there is a finite (and hopefully

    small) number of candidates, but we have omitted how to select these candidates. The

    computational complexity of each optimisation step is O(nm3), where n is the number of

    snaxels and m the number of candidates for every snaxel. Thus, it is very important to

    maintain m low.

    In [2] only a small neighbourhood around the previous position of the snaxel was

    considered. However, the algorithm was iteratively applied starting from the obtained

    solution until there was no change in the total energy of the snake. This method has

    several disadvantages. First, like in the approaches which use variational techniques for

    the minimization, the snake can fall into a local minimum. Second, the computational

    time can be very high if the initialisation is far from the minimum.

    In [13] and [14] a different set of candidates is considered for every snaxel. In

    particular, [13] establishes uncertainty lists for the high curvature points and defines a

  • 7/29/2019 Image Comm Pardas Bonafonte

    13/30

    search space between these uncertainty lists. In [14] the search zone is defined with two

    initial concentric contours. Each contour point is constrained to lie on a line joining

    these two initial contours. This approach gives very good results if the two concentric

    contours that contain the expected contour are available and the contour being tracked is

    the absolute minima in this area. However, these concentric contours are not always

    available.

    In the next sections we will describe how we can select these candidates and how the

    snake energy is defined for facial feature point detection and tracking, respectively.

    2.3 FACIAL FEATURE POINT DETECTION

    2.3.1 Selection of candidates

    We propose a new method that in a first step needs to fix the topology of the snakes. In

    our case, we are using a 16 snaxels equally spaced snake for the mouth and a 8 snaxels

    equally spaced snake for the eyebrows. To select the best candidates for each of these

    snaxels we compute what we call the vi-eigen-snaxel, by extracting samples of them

    from a database. That is, after resizing the faces from our database to the same size

    (200x350), we extract for each snaxel vi the 16x16 area around the snaxel in every

    image of the database. After an illumination normalization using an histogram

    specification for every snaxel, the extracted sub-images are used to form the training set

    of vectors for the snaxel vi, and from them, the eigenvectors (eigen-snaxels) are

    computed by classical PCA techniques.

    The first step for initialising the snakes in a new image is to roughly locate the face.

    Different techniques can be used for this aim [17], [30]. After size normalization of the

    face area, a large area around the rough position for every snaxel vi is examined by

  • 7/29/2019 Image Comm Pardas Bonafonte

    14/30

    computing the reconstruction error with the corresponding vi-eigen-snaxel. Those pixels

    leading to the smaller reconstruction error are considered as candidates for the snaxel vi.

    Principal Component Analysis (PCA)

    The principle of PCA is to construct a low dimensional space with decorrelated features

    which preserve the majority of the variation in the training set [20]. PCA has been used

    in the past to detect faces (by means of eigen-faces) or facial features (by means of

    eigen-features). In this paper, we extend its use to detect the snake control points for

    specific facial features (by using eigen-snaxels).

    For each snaxel vi, the vectors {xt} are constructed by lexicographic ordering of the

    pixel elements of each sub-image from the training set. A partial KLT is performed on

    these vectors to identify the largest-eigenvalue eigenvectors.

    Distance measure

    To evaluate the feasibility for a given pixel to be a given snaxel we first construct the

    vector {xt} using the corresponding sub-image. Then we obtain a principal component

    feature vectory=MT x~ , where xxx =~ is the mean-normalized image vector, isthe eigenvector matrix for snaxel vi and M is a sub-matrix of that contains the M

    principal eigenvectors. The reconstruction error is calculated as:

    ( ) += =

    ==N

    Mi

    M

    i

    i

    1 1

    222 ~ yxyx 2i (7)

    We have used M=5 in our experiments. The pixels producing the smaller reconstruction

    error are selected as candidate locations for the snaxel vi.

    2.3.2. Snake energy

  • 7/29/2019 Image Comm Pardas Bonafonte

    15/30

    External energy

    A new term is added to the external Energy function in Eq (3) to take into account the

    result obtained in the PCA process. The energy will be lower in those positions where

    the reconstruction error is lower.

    )()1()( 2 ivvi vv 1ii += +context EE (8)

    In our experiments the value of has been set to 0.5.

    Internal energy

    As mentioned, the Energy term defined in (2) is only appropriate for snaxels that are not

    corners of the contour. In the case of corners the energy has to be low when the second

    derivative is high. We use, for the energy of the snaxels belonging to the eyebrows and

    the mouth:

    ( )[ ]1ii1ii1ii vvvvvvv ++ +++= 2)1(2)( 1int BE iii (9)

    where i is set to 1 ifvi is not a corner point and to 0 if it is (note that we work with a

    fixed topology for the snakes). B represents the maximum value that the approximation

    of the second derivative can take.

    2.4 FEATURE POINT TRACKING

    2.4.1. Selection of candidates

    In the initialisation of the snake, the candidates for every snaxel were selected on the

    basis of a statistical characterization of the texture around them. However, in the case of

  • 7/29/2019 Image Comm Pardas Bonafonte

    16/30

    tracking, we have a better knowledge of this texture if we use the previous frame. We

    propose to find the candidates for every snaxel in the tracking process using motion

    estimation in order to select the search space for every snaxel.

    A small region around every snaxel is selected as basis for the motion estimation. The

    shape of this region is rectangular and its size is application dependent. However, the

    region should be small enough so that its motion can be approximated by a translational

    motion. The motion compensation error (MCE) for all the possible displacements

    (dx,dy) of the block in a given range is computed as:

    =

    =

    =

    =

    ++=Ryj

    Ryj

    Rxi

    Rxi

    dyjydxixjyixdydx2

    0000 ),(),(),( t1tvi0 IIMCE (10)

    being (x0, y0) the x and y coordinates of the snaxel vi in the previous frame, which we

    have called vi0. The region under consideration is centred at the snaxel and has size 2Rx

    in the horizontal dimension and 2Ry in the vertical dimension.

    The range for (dx,dy) determines the maximum displacement that a snaxel can suffer.

    The M positions with the minumum MCEvi0(dx,dy) are selected as possible new

    locations for snaxel vi.

    2.4.2 Snake energy

    External energy

    We use for the external energy in the tracking procedure the same principle as for the

    initialisation. That is, it will be composed, as in (8), of two terms. The first one is the

    gradient along the contour, but the second one is slightly different, as in this case we

    can use the texture provided by the previous frame instead of the statistical

    characterization. Thus, the second term corresponds to the compensation error obtained

    in the motion estimation. In this way preference is given to those positions with the

  • 7/29/2019 Image Comm Pardas Bonafonte

    17/30

    smaller compensation error. That is, the energy will be lower in those positions whose

    texture is most similar to the texture around the position of the corresponding snaxel in

    the previous frame. Therefore, the expression for the external energy will be:

    )()1()( ivvvi vMCEv i01i += icontext EE (11)

    The constant is also set to 0.5.

    Internal energy

    The internal energy we use to track feature points has the same formulation as in section

    2.4.2. As before, we assume here that we know the geometry of the different face

    features to correctly set i to 1 or 0, Eq. (9).

    To solve problems of snaxel grouping we also add, as in [18], another term to the

    internal energy that forces snaxels to preserve the distance between them from one

    frame to the next one:

    idistE , = | | + | | (12)t

    i

    1t

    i uu + t

    1i

    1t

    1i uu +++

    t

    iu = , in frame tiv 1iv

    So when the distance is altered the energy increases proportionally.

    2.5 RESULTS

    In Figures 3 and 4 we show some examples of automatic initialisation and tracking. To

    perform the tests, the algorithms have been introduced in a Graphical User Interface

    (GUI) that allows manual correction of the snaxels position. In the tests performed, the

  • 7/29/2019 Image Comm Pardas Bonafonte

    18/30

    contours obtained with the automatic initialisation always correspond to the facial

    features, if the face has been located accurately. However, as shown in Figure 4,

    sometimes some points should be manually modified if more accuracy is required. The

    tracking has been performed on 276 sequences from the Cohn-Kanade facial expression

    database [15]. From these sequences, 180 needed no manual correction, 48 needed 1, 2

    or 3 points correction along the sequence and 40 sequences had one major problem in at

    least one frame for one feature tracking.

    From the initialization, the Facial Animation Units (FAPUs) are computed. This means

    that the first frame needs to present a neutral expression. Then, from the output of the

    tracking, the FAPs for the sequence are extracted. These FAPs are going to be used in

    the second part of the system for the expression recognition.

  • 7/29/2019 Image Comm Pardas Bonafonte

    19/30

    Figure 3. First row: Initialization examples. Second and third row: tracking examples.

    Figure 4. Automatic initialisation examples with minor errors

    3. HIGH LEVEL ANALYSIS

    The second part of the system is based on the modeling of the expressions by means of

    Hidden Markov Models. The observations to model are the MPEG-4 standardized

    Facial Animation Parameters (FAPs) computed in the previous step. The FAPs of a

    video sequence are first extracted and then analyzed using semi-continuous HMM.

    The Cohn-Kanade facial expression database [15] has been selected as basis for doing

    the training of the models and the evaluation of the system. The whole database has

    been processed in order to extract a subset of the Low Level FAPs and perform the

    expression recognition experiments. Different experiments have been carried out to

  • 7/29/2019 Image Comm Pardas Bonafonte

    20/30

    determine the best topology of the HMM to recognize expressions from the Low Level

    FAPs.

    HMM are one of the basic probabilistic tools used for time series modelling. They are

    definitely the most used model in speech recognition, and they are beginning to be used

    in vision, mainly because they can be learned from data and they implicitly handle time-

    varying signals by dynamic time warping. They have already been successfully used to

    classify the mouth shape in video sequences [22] and to combine different information

    sources (optical flow, point tracking and furrow detection) for expression intensity

    estimation [19]. HMM were also used in [29] for expression recognition. In this work a

    spatio-temporal dependent shape/texture eigenspace is learned in a HMM like structure.

    We will extend the use of HMM creating the feature vectors from the available low-

    level FAP. In the following sections we review the basic concepts of HMM (3.1), define

    the topology of the models that we will use (3.2) and explain how we estimate the

    parameters of these models (3.3).

    This first approach tries to recognize isolated expressions. That is, we take a sequence

    which contains the transition from a neutral face to a given expression and we have to

    decide which is this expression. All the sequences from the Cohn-Kanade facial

    expression database belong to this type. We will evaluate the system using this

    approach in section 3.4. In section 3.5 we will introduce a new model which will allow

    us to use the system in more complex environments. It will model the parts of the

    sequences where the person is talking. Finally, in section 3.6 a more complex

    experiment is presented, where the system classifies the different parts of the sequences

    using a one-stage dynamic programming algorithm.

    3.1 HIDDEN MARKOV MODELS DEFINITION

  • 7/29/2019 Image Comm Pardas Bonafonte

    21/30

    HMM model temporal series assuming a (hidden) temporal structure. For each emotion

    a HMM can be estimated from training examples. Once these models are trained we

    will be able to compute, for a given sequence of observations (the FAPs of the video

    sequence in our case), the probability that this sequence was produced by each of the

    models. Thus, this sequence of observations will be assigned to the expression whose

    HMM gives a higher probability of generating it.

    The HMM [27] are defined by a number N of states connected by transitions. The

    output of these states are the observations, which belong to a set of symbols. The time

    variation is associated with transitions between states.

    To completely define the models, the following parameters are used:

    The set of states S=(s1,s2,...,sN)

    The set of symbols that can be observed from the states (O1,...OM), in our case, the

    FAPs from each frame

    The probability of transition from state i to statej, defined by the transition matrix

    A, where

    aij=p(qt+1=sj|qt=si) 1i, jN

    The probability distribution function of the different observations in state j, B,

    where

    bj(Oi)=P(Oi| q=Sj) 1 j N

    The initial state probabilities i=p(q1=Si) 1 i NTo train a HMM for an emotion (that is, to adjust its parameters), we will use as training

    sequences all those sequences that we select as representatives of the given emotion. In

    our case, we use all those sequences from the Cohn-Kanade facial expression database

    manually classified as belonging to this emotion.

  • 7/29/2019 Image Comm Pardas Bonafonte

    22/30

    3.2 SELECTION OF THE HMM TOPOLOGY

    The Baum-Welch algorithm [27] can estimate the value ofA and . However, better

    results can be obtained if the topology of the Markov chain is restricted using a priori

    knowledge. In our case, each considered emotion (sad, anger, fear, joy, disgust and

    surprise) reflects a temporal structure: let's name it start, middle and end of the emotion.

    This structure can be modeled using left-to-right HMM. This topology is appropriated

    for signals whose properties change over time in a successive manner. It implies aij=0 if

    i>j and i=1 fori=1, i=0 fori1. As the time increases, the observable symbols in eachsequence either stay at the same state or increase in a successive manner. In our case,

    we have defined for each emotion a three-states HMM.

    To select this topology different configurations have been tested. The experiments show

    that using HMM with only 2 states the recognition rate is 4% lower. Using more than 3

    states increases the complexity of the model without producing any improvement in the

    recognition results. The selected topology is represented in the following figure.

    Figure 5. Topology of Hidden Markov Models

    3.3 ESTIMATION OF THE OBSERVATION PROBABILITIES

    Each state of each emotion models the observed FAPs using a probability function,

    bj(O) which can be either a discrete or a continuous one. In the first case, each set of

  • 7/29/2019 Image Comm Pardas Bonafonte

    23/30

    FAPs for a given image has to be discretized using vector quantification. In the

    continuous case, a parametric pdf is defined for each state. Typically, multivariate

    Gaussian or Laplacian pdf's are used. In the case of having sparse data, as it is our case,

    better results can be achieved if all the states of all the emotion models share the

    Gaussian mixtures (mean and variance). Then, the estimation of the pdf's for each state

    is reduced to the estimation of the contribution of each mixture to the pdf.

    Each set of FAPs has been divided in two subsets corresponding to eyebrows and

    mouth, following the MPEG4 classification. At each frame, the probability of the whole

    set of FAPs is computed as the product of the probability of each subset (independence

    assumption). For each of these subsets, a set of 32 Gaussian mixtures has been

    estimated (mean and variance) applying a clustering algorithm on the FAPs of the

    training database. In order to select the number of Gaussian mixtures, different

    experiments have been performed. Having less than 16 mixtures produces a recognition

    rate more than a 10% lower than that obtained with 16 mixtures. The use of 32 mixtures

    produces a slightly better recognition rate than 16, while more than 32 do not produce

    any improvement.

    3.4 EVALUATION OF THE SYSTEM

    To evaluate the system we use the Cohn-Kanade facial expression database, which

    consists of the recording from 90 subjects, each one with several basic expressions. The

    recording was done with a standard VHS video camera, with a video rate of 30 frames

    per second, with constant illumination and only full-face frontal views were captured.

    Although the subjects were not previously trained in displaying facial expressions, they

    practiced the expressions with an expert prior to video recording. Each posed expression

  • 7/29/2019 Image Comm Pardas Bonafonte

    24/30

    begins from neutral and ends at peak expression. The number of sequences for each

    expression is the following:

    F Sa Su J A D# Seq 33 52 60 61 33 37

    Table 2. Number of available sequences for each expression (F: Fear, Sa: Sad, Su:

    Surprise, J: Joy, A: Anger, D: Disgust).

    The FAPs of these sequences were first extracted using the FAP extraction tool

    described in Section 2. The system will train, for every defined facial expression (sad,

    anger, fear, joy, disgust and surprise), a HMM with the extracted FAPs. In the

    recognition phase, the HMM system computes the maximum probability of the input

    sequence with respect to all the models and assigns to the sequence the expression with

    the highest probability.

    The system has been evaluated by first training the six models with all the subjects

    except one, and then testing the recognition with this subject which has not participated

    in the training. This process has been repeated for all the subjects, obtaining the

    following recognition results:

    F Sa Su J A D %Cor

    F 26 3 0 3 0 1 78,7%

    Sa 8 31 2 2 6 3 59,6%

    Su 0 0 60 0 0 0 100%

    J 4 0 0 57 0 0 93,4%

    A 0 2 0 1 24 6 72,7%

    D 0 0 0 0 1 36 97,3%

    Table 3. Number of sequences correctly detected and number of confusions.

    The overall recognition rate obtained is 84%.

    These results are coherent with our expectation: surprise, joy and disgust have the

    higher recognition rates, because they involve clear motion of mouth or eyebrows,

    while sadness, anger and fear have a higher confusability rate. Experiments have been

  • 7/29/2019 Image Comm Pardas Bonafonte

    25/30

    repeated involving only three emotions, obtaining 98% recognition rate when joy,

    surprise and anger are involved, and 95% with joy, surprise and sadness.

    We have also studied which one of the facial features conveys more information about

    the facial expressions, by obtaining the recognition results using only the FAPs related

    with the motion of the eyebrows, or only the FAPs related with the motion of the

    mouth. The overall recognition rate in the first case is 50%, while in the second it is

    78%. This result shows that the mouth conveys most of the information, although the

    information of the eyebrows helps to improve the results.

    Although it is difficult to compare, due to the usage of different databases, the

    recognition rates obtained using all the extracted information are of the same order than

    those obtained by other systems which use dense motion fields combined with

    physically-based models, showing that the MPEG4 FAPs convey the necessary

    information for the extraction of these emotions.

    3.5 INTRODUCTION OF THE TALK MODEL

    An important objective of this work is to use sequences that represent a conversation

    with an agent, in order to consider a real situation where the system could be applied.

    The system developed should be able to classify expressions in silence frames as well as

    to detect important clues in speech frames. A first step has been to work on silence

    frames. However, we also want to separate the motion due to the expressions from the

    motion due to speech. One possibility is to combine audio analysis with video analysis.

    The other possibility that we have developed consists on the generation of a new model,

    in addition to those of the six emotions, to classify the speech sequences.

    In order to train this additional model, that we have designated as talk, we have

    selected 28 sequences that contain speech, from available data sequences. A few more

  • 7/29/2019 Image Comm Pardas Bonafonte

    26/30

    sequences corresponding to different expressions have also been added to our dataset.

    These sequences have been processed with the FAP extraction tool, in the same way

    that the emotion models had been previously analysed. Then, the recognition tests were

    repeated including these sequences. The performance of the recognition algorithm in

    this case is shown in next table. The overall recognition rate is 81%, showing the

    possibility to distinguish talk from emotion using this methodology.

    F Sa Su J A D T %Cor

    F 27 2 1 3 0 1 0 79,4%

    Sa 6 29 3 2 7 1 5 54,7%

    Su0 0 64 0 0 0 0 100%J 4 0 1 57 0 0 0 91,9%

    A 0 1 0 2 22 7 2 64,7%

    D 1 0 0 0 1 34 0 91,9%

    T 0 3 0 0 4 0 21 75,0%

    Table 4. Number of sequences correctly detected and number of confusions adding the

    talk (T) sequence. The overall recognition rate obtained is 81%.

    3.6 CONNECTED SEQUENCES RECOGNITION

    As discussed before, the first experiments performed were oriented to the extraction of

    an expression from the input sequence. However, in real situations, we will have a

    continuous video sequence where the person is talking or just looking at the camera and,

    at a given point, an expression will be performed by the person. So, we want to explore

    the capability of the system to separate, from a video sequence, different parts where

    different emotions occur.

    In this case the emotions can be decoded using the same HMM by means of a one-stage

    dynamic programming algorithm. This algorithm is widely used in connected and

    continuous speech recognition. The basic idea is to allow during the decoding a

    transition from the final state of each HMM (associated to each emotion) to the first

    state of the other HMMs. This can be interpreted as a large HMM composed from the

  • 7/29/2019 Image Comm Pardas Bonafonte

    27/30

    emotion HMMs. The transitions along this large HMM indicate both the decoded

    emotions and the frames associated to each emotion. In order to recover the transitions,

    the algorithm needs to save some backtracking information, as it is usually the case in

    dynamic programming.

    Because of the lack of a more appropriate database, the connected recognition algorithm

    has been tested by concatenating sequences of six FAP files extracted from the

    emotions and talk sequences. Results are given in next table. The recognition rate

    for these sequences is 64%, showing a lower performance than the isolated sequences

    recognition.

    F Sa Su J A D T %Cor

    F 25 4 1 3 0 1 0 73,5%Sa 5 25 4 2 8 2 7 47,1%

    Su 2 2 37 0 6 4 9 57,8%

    J 2 2 1 56 0 0 0 91,8%

    A 0 2 1 2 16 4 8 47,0%

    D 2 0 9 0 3 19 3 51,3%

    T 1 8 4 3 6 1 56 70,9%

    Table 5. Number of sequences correctly detected and number of confusions in the

    connected sequences experiment. The overall recognition rate obtained is 64%.

    4. CONCLUSIONS

    We have presented a video analysis system that aims at expression recognition in two

    steps. In the first step, the Low Level Facial Animation Parameters are extracted. The

    second step analyses these FAPs by means of HMM to estimate the expressions. For

    training purposes, a supervised active contour based algorithm is applied to extract these

    Low Level FAPs from an emotions face database. This algorithm introduces a criterion

    for selection of the candidate snaxels in a dynamic approach implementation of the

    active contours algorithm. Besides, an additional term is introduced in the external

  • 7/29/2019 Image Comm Pardas Bonafonte

    28/30

    energy, related to the specific texture around the snaxels, in order to avoid the contour

    to fall in stronger edges that might be around the one we are aiming at. This technique

    allows to extract the Low Level FAPs that we will use for training the recognition

    system. Once the system has been trained, the recognition procedure can be applied to

    the FAPs extracted by any other method. The results obtained prove that the FAPs

    convey the necessary information for the extraction of emotions. The system shows

    good performance for distinguishing isolated expressions and can also be used, with

    lower accuracy, to extract the expressions in long video sequences where speech is

    mixed with silence frames.

    5. REFERENCES

    [1] J. Ahlberg, An Active Model for Facial Feature Tracking. Accepted for publication in

    EURASIP Journal on Applied Signal processing, to appear in June 2002.

    [2] A. Amini, T. Weymouth and R. Jain, Using Dynamic Programming for Solving

    Variational Problems in Vision, IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 12, No. 9, September 1990.

    [3] M. Black, Y. Yacoob, Recognizing facial expressions in image sequences using local

    parameterized models of image motion, Int. Journal of Computer Vision, 25 (1), 1997,

    23-48.

    [4] J. Cassell, Embodied Conversation: Integrating Face and Gesture into Automatic Spoken

    Dialogue Systems, In Luperfoy (ed.), Spoken Dialogue Systems, Cambridge, MA: MIT

    Press.

    [5] F. Cassell, K.R. Thorisson, The power of a nod and a glance: Envelope vs. Emotional

    Feedback in Animated Conversational Agents, Applied Artificial Intelligence 13: 519-

    538, 1999.

    [6] T.F. Cootes, G.J. Edwards, C.J. Taylor, C.J. Active appearance models, IEEE

    Transactions on Pattern Analysis and Machine Intelligence, Volume: 23 Issue: 6 , pp. 681

    685, June 2001.

  • 7/29/2019 Image Comm Pardas Bonafonte

    29/30

    [7] G.W. Cottrell and J. Metcalfe, EMPATH: Face, emotion, and gender recognition using

    holons, in Neural Information Processing Systems, vol.3, pp. 564-571, 1991.

    [8] D. DeCarlo, D. Metaxas, Adjusting shape parameters using model-based optical flow

    residuals, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 24

    Issue: 6 , pp. 814 823, June 2002

    [9] P. Eisert, B. Girod, Analyzing Facial Expression for Virtual Conferencing, IEEE

    Computer Graphics and Applications, Vol. 18, N. 5, 1998.

    [10] P. Ekman and W. Friesen, Facial Action Coding System. Consulting Psychologists Press

    Inc., 577 College Avenue, Palo Alto, California 94306, 1978.

    [11] I. Essa and A. Pentland, Facial Expression Recognition using a Dynamic Model and

    Motion Energy, Proc. of the International Conference on Computer Vision 1995,

    Cambridge, MA, May 1995.

    [12] I. Essa, Analysis, Interpretation and Synthesis of Facial Expressions, Ph.D. Thesis,

    Massachusetts Institute of Technology (MIT Media Laboratory).

    [13] D. Geiger, A. Gupta, L. Costa and J. Vlontzos, Dynamic Programming for Detection,

    Tracking, and Matching Deformable Contours, IEEE Transactions on Pattern Analysis

    and Machine Intelligence, Vol. 17, No. 3, March 1995.

    [14] S. Gunn and M. Nixon, Global and Local Active Contours for Head BoundaryExtraction, International Journal of Computer Vision 30(1), pp. 43-54, 1998.

    [15] Kanade, T., Cohn, J.F., Tian. Y., Comprehensive Database for Facial Expression

    Analysis, Proceedings of the Fourth IEEE International Conference on Automatic Face

    and Gestures Recognition, Grenoble, France, 2000.

    [16] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active Contour Models, International

    Journal of Computer Vision, Vol. 1, No. 4, pp. 321-331, 1988.

    [17] C. Kervrann, F. Davoine, P. Perez, R. Forchheimer and C. Labit, Generalized likelihood

    ratio-based face detection and extraction of mouth features, Pattern Recognition letters

    18, pp. 899-912, 1997.

    [18] C. L. Lam, S. Y. Yuen, An unbiased active contour algorithm for object tracking,

    Pattern Recognition Letters 19, pp. 491-498, 1998.

    [19] J. Lien, T. Kanade, J. Cohn, C. Li, Subtly different Facial Expression Recognition And

    Expression Intensity Estimation, in Proc. Of the IEEE Int. Conference on Computer

    Vision and Pattern Recognition, pp. 853-859, Santa Barbara, Ca, June 1998.

  • 7/29/2019 Image Comm Pardas Bonafonte

    30/30

    [20] B. Moghaddam, A. Pentland, Probabilistic Visual Learning for Object Detection, in 5th

    International Conference on Computer Vision, Cambridge, MA, June 95.

    [21] ISO/IEC MPEG-4 Part 2 (Visual)

    [22] N. Oliver, A. Pentland, F. Berard, LAFTER: A Real-time Lips and Face Tracker with

    Facial Expression Recognition, in Proc. of IEEE Conf. on Computer Vision, Puerto

    Rico, 1997.

    [23] C. Padgett, G. Cottrell, Identifying emotion in static face images, in Proc. Of the 2nd Joint

    Symposium on Neural Computation, Vol.5, pp.91-101, La Jolla, CA, University of

    California, San Diego.

    [24] M. Pantic , L. Rothkrantz, Automatic Analysis of Facial Expressions: The State of the

    Art, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, N. 12,

    pp. 1424-1445, 2000.

    [25] M. Pards, E. Sayrol, Motion estimation based tracking of active contours, Pattern

    Recognition Letters, Vol. 22/13, pp. 1447-1456, Ed. Elsevier Science, ISSN, November

    2001.

    [26] R.W. Picard, Affective Computing, MIT Press, Cambridge 1997.

    [27] L. Rabiner, A tutorial on Hidden Markov Models and selected applications in Speech

    Recognition, Proceedings IEEE, pp. 257-284, February 1989.

    [28] M. Rosenblum, Y. Yacoob, L.S. Davis, Human Expression Recognition from Motion

    Using a Radial Basis Function Network Architecture, IEEE Trans. On Neural Networks, 7

    (5), 1121-1138, 1996.

    [29] F. de la Torre, Y. Yacoob, L. Davis, A probabilistic framework for rigid and non-rigid

    appearance based tracking and recognition, Int. Conf. on Automatic Face and Gesture

    Recognition, pp. 491-498, 2000.

    [30] V. Vilaplana, F. Marqus, P. Salembier, L Garrido, Region Based Segmentation and

    Tracking of Human Faces, Proceedings of European Signal Processing Conference

    (EUSIPCO-98), pp. 311-314, 1998.

    [31] Y. Yacoob, L.S. Davis, Computing Spatio-Temporal Representation of Human Faces,

    IEEE Trans. On PAMI, 18 (6), 636-642

    [32] A. Young and H. Ellis (eds.), Handbook of Research on Face Processing, Elsevier

    Science Publishers 1989.