Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with...

8
Fully Automatic Upper Facial Action Recognition Ashish Kapoor Yuan Qi Rosalind W. Picard MIT Media Laboratory MIT Media Laboratory MIT Media Laboratory Cambridge, MA 02139 Cambridge, MA 02139 Cambridge, MA 02139 Abstract This paper provides a new fully automatic framework to an- alyze facial action units, the fundamental building blocks of facial expression enumerated in Paul Ekman’s Facial Action Coding System (FACS). The action units examined in this paper include upper facial muscle movements such as inner eyebrow raise, eye widening, and so forth, which combine to form facial expressions. Although prior meth- ods have obtained high recognition rates for recognizing facial action units, these methods either use manually pre- processed image sequences or require human specification of facial features; thus, they have exploited substantial hu- man intervention. This paper presents a fully automatic method, requiring no such human specification. The system first robustly detects the pupils using an infrared sensitive camera equipped with infrared LEDs. For each frame, the pupil positions are used to localize and normalize eye and eyebrow regions, which are analyzed using PCA to recover parameters that relate to the shape of the facial features. These parameters are used as input to classifiers based on Support Vector Machines to recognize upper facial action units and their all possible combinations. On completely natural dataset with lots of head movements, pose changes and occlusions, our framework achieved a recognition ac- curacy of 69.29% for each individual AU and an accuracy of 62.5% for all possible AU combinations. On the Cohn- Kanade AU-coded facial expression database, which has been previously used to evaluate facial action recognition system, our framework achieves a high recognition accu- racy for individual AU’s and for all possible AU combina- tions. 1 Introduction A very large percentage of our communication is nonverbal and among these nonverbal cues a large fraction is in the form of facial actions. A system that could analyze the fa- cial actions in real time without any human intervention will have applications in a number of different fields: for exam- ple, computer vision, affective computing, computer graph- ics and psychology. Such a system will be an important component in a machine that is socially and emotionally in- telligent and is expected to interact naturally with people. While a lot of research has been directed towards sys- tems that recognize faces corresponding to prototypic ex- pressions like joy, anger and surprise, few approaches exist that try to recognize facial actions such as eye-squint and frown. Table 1 compares some of the previous facial expres- sion analysis techniques. The state of the art systems have severe limitations as they either require human intervention or do not recognize more than prototypic expressions. This paper describes a fully automatic framework that requires no manual intervention to analyze facial activity. The work is focused on recognizing upper action units(AUs) which are a subset of all the AUs enumerated in Paul Ekman’s Facial Action Coding System (FACS) [9] and correspond to the regions of eyes and eyebrows (Table 2). The Fa- cial Action Coding System (FACS) developed by Ekman and Friesen [9] is a method of measuring facial activity in terms of facial muscle movements. FACS consists of over 45 distinct AUs corresponding to a distinct muscle or mus- cle group and are essentially facial phonemes, which can be assembled to form facial expressions. Finally, most re- searchers have reported results on clean datasets, which are videos and images of the frontal face of the subjects delib- erately making facial actions in front of a camera. We eval- uate our framework on a test dataset which is completely natural and therefore has lots of pose changes, head move- ments, occlusions and very subtle facial activity. To our best knowledge this is the only work that is evaluated on a completely natural database, therefore demonstrating how computer vision and machine learning can be integrated to build real-world applications. 2 Previous Work Researchers in the past have used a number of classification techniques to recognize action units and their combinations. Tian et al [18] have developed a system to recognize six- teen action units and any combination of those. The shape of facial features like eyes, eyebrow, mouth and cheeks are 1

Transcript of Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with...

Page 1: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

Fully Automatic Upper Facial Action Recognition

Ashish Kapoor YuanQi RosalindW. Picard

MIT MediaLaboratory MIT MediaLaboratory MIT MediaLaboratoryCambridge,MA 02139 Cambridge,MA 02139 Cambridge,MA 02139

Abstract

Thispaper providesa new fully automaticframeworkto an-alyzefacial action units, the fundamentalbuilding blocksof facial expressionenumerated in Paul Ekman’s FacialAction CodingSystem(FACS).Theaction units examinedin this paperinclude upper facial musclemovements suchas inner eyebrow raise, eye widening, and so forth, whichcombine to form facial expressions.Although prior meth-ods haveobtained high recognition ratesfor recognizingfacial actionunits, thesemethodseitherusemanually pre-processedimage sequencesor require humanspecificationof facial features; thus,they haveexploitedsubstantialhu-man intervention. This paper presentsa fully automaticmethod, requiringnosuch humanspecification. Thesystemfirst robustly detectsthe pupils usingan infrared sensitivecamera equippedwith infraredLEDs. For each frame, thepupil positionsare usedto localizeandnormalizeeyeandeyebrow regions,which are analyzedusingPCAto recoverparameters that relate to the shapeof the facial features.Theseparameters are usedas input to classifiers basedonSupport Vector Machines to recognizeupperfacial actionunits and their all possiblecombinations. On completelynatural datasetwith lots of headmovements,posechangesandocclusions,our framework achieveda recognition ac-curacy of 69.29% for each individual AU andan accuracyof 62.5% for all possibleAU combinations. On the Cohn-Kanade AU-coded facial expressiondatabase, which hasbeenpreviously usedto evaluate facial action recognitionsystem,our framework achieves a high recognition accu-racy for individual AU’s andfor all possibleAU combina-tions.

1 Intr oduction

A very largepercentageof ourcommunicationis nonverbalandamong thesenonverbal cuesa large fraction is in theform of facialactions.A systemthatcouldanalyze the fa-cial actionsin realtimewithoutany humaninterventionwillhaveapplicationsin a number of differentfields: for exam-ple,computervision,affectivecomputing,computergraph-

ics and psychology. Sucha systemwill be an importantcomponentin amachinethatis sociallyandemotionally in-telligentandis expectedto interactnaturallywith people.

While a lot of researchhasbeendirectedtowardssys-temsthat recognize facescorrespondingto prototypic ex-pressions like joy, angerandsurprise,few approachesexistthat try to recognize facial actionssuchaseye-squint andfrown. Table1comparessomeof thepreviousfacialexpres-sionanalysistechniques. Thestateof theart systemshaveseverelimitationsasthey eitherrequirehuman interventionor do not recognizemore thanprototypic expressions.Thispaper describes a fully automatic framework that requiresnomanualintervention to analyze facialactivity. Theworkis focusedon recognizing upper action units(AUs) whichare a subsetof all the AUs enumeratedin Paul Ekman’sFacial Action CodingSystem(FACS) [9] andcorrespondto the regions of eyes and eyebrows (Table 2). The Fa-cial Action CodingSystem(FACS) developed by EkmanandFriesen[9] is a methodof measuring facialactivity intermsof facialmusclemovements.FACS consistsof over45 distinctAUs correspondingto a distinctmuscleor mus-cle group andareessentiallyfacial phonemes,which canbeassembledto form facialexpressions.Finally, mostre-searchers havereportedresultsoncleandatasets,whicharevideos andimages of thefrontal faceof thesubjectsdelib-eratelymakingfacialactionsin front of acamera.We eval-uateour framework on a testdatasetwhich is completelynatural andtherefore haslots of posechanges, headmove-ments,occlusionsand very subtlefacial activity. To ourbestknowledgethis is theonly work that is evaluatedon acompletely natural database,therefore demonstratinghowcomputervision andmachinelearningcanbeintegratedtobuild real-world applications.

2 Previous Work

Researchersin thepasthaveusedanumberof classificationtechniquesto recognizeactionunitsandtheircombinations.Tian et al [18] have developeda systemto recognize six-teenactionunitsandany combinationof those.Theshapeof facial featureslike eyes,eyebrow, mouthandcheeks are

1

Page 2: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

Table1: Comparisonof variousfaceanalysissystems

Fully Recognizemoreautomatic than prototype

expressions

Black & Yacoob [2] No No1995

Esaaet al [10] Yes No1997

Tian et al [18] No Yes2000

Table2: Theupper facialactionunits

AU number Facial action

1 Inner brow raiser2 Outerbrow raiser4 Brow lowerer5 Uppereye lid raiser7 Lid tightener

describedby multistatetemplates.Theparametersof thesemultistatetemplatesareusedby a NeuralNetwork basedclassifierto recognizetheactionunits.Thissystemrequiresthat the templates beinitialized manuallyin thefirst frameof the sequence, which preventsit from being fully auto-matic. In anearlierwork, Lien et al [15] describea systemthat recognizesvarious action units basedon denseflow,feature point trackingandedgeextraction.

Donatoet al [7] comparedseveral techniques,which in-cludedopticalflow, principalcomponentanalysis,indepen-dent component analysis,local featureanalysisand Ga-bor wavelet representation,to recognize eight single ac-tion units and four action unit combinationsusing imagesequencesthatweremanuallyalignedandfreeof headmo-tions. They showed95.5% recognition accuracy usingIn-dependentComponentAnalysisandGabor wavelet repre-sentations. They haveusedanearestneighborclassifierandtemplate matching for thepurposeof recognition. Eachfa-cial actioncombinationthat they try to recognizeis treatedasa separateAU. As therearea largenumber of AU com-binations,modeling eachAU combinationseparatelyis notappropriate.Bartlettetal [1] achieve90.9%accuracy in rec-ognizing 6 singleactionunitsby combining holistic facialanalysis andoptical flow with local feature analysis.Bothof theabovementionedapproaches[7, 1] report theirresultson manually pre-processedimagesequencesof individualsdeliberatelymakingfacialactionsin front of acamera.

Tracked TemplatesTracked Pupils

Pupil Tracking

Infrared camerawith IR LEDs tocapture video

Facial FeatureTracking

PatternAnalyzer

SVMs

AU 7AU5AU2

AU6AU4AU1

Figure 1: Theoverall system

Cowie et al [6] describea systemto recognizefacialex-pressions by identifying Facial Animation ParameterUnits(FAPUs)definedin MPEG-4standard by featuretrackingof Facial DefinitionParameter(FDP) points,alsodefinedinMPEG-4framework. Thesystemis not fully automaticandrequireshuman assistanceto accuratelydetectFDPpoints.

A lot of researchhasbeendirectedat the problem ofrecognizing 5-7 classesof prototypic emotional expres-sions on groups of people from their facial expressions[10, 2,19, 20]. Althoughprototypicexpressions,likehappy,surpriseandfear, arenatural,they occur infrequentlyin ev-eryday life. A personmightcommunicatemorewith subtlefacialactionslike frequentfrowns or smiles. Further thereare emotions like confusion, boredom and frustration forwhich any prototypic expressionmight not exist. Thus,asystemthataimsto besociallyandemotionally intelligentneedstodomorethanjustrecognizeprototypicexpressions.

3 Overall Framework

Figure1 gives you an overview of the system. The red-eye effect [11] is a a physiologicalproperty of theeye andthefirst part concerns usingit to robustly track the pupils.Oncethepupil positions areknown, thosearethenusedtonormalizethe images andextract parametersthat describefacial featuresandcanbe usedto recognize the facial ac-tions. Finally, the upperfacial actionunits arerecognizedusingSupport VectorMachines.A separateFisherkernel istrainedfor eachfacialactionunit. Theextractedparametersareusedasinput featuresto thesupport vectormachinestodetectoccurrenceof facialactions.Sincewe usea separateclassifierfor eachactionunit, they can detectactionunitcombinations.

2

Page 3: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

Image captured by the IR camera

The difference image

De−interlaced sampled image,when the on−axis LEDs are offwhen the on−axis LEDs are on

De−interlaced sampled image,

Figure2: Pupil trackingusingtheinfraredcamera

3.1 Pupil Detection

Thepupil detectionsystemdetectsthepupils usingthered-eye effect. Thesystem’s robustnessto occlusionsandheadmotions makesit ideal to be usedfor automaticfacial ac-tion analysis.As thepupil positions canberecoveredveryefficiently androbustly, it eliminatestheneedof manual la-belingor pre-processingof theimages,a required stepthatplaguesa numberof previousapproaches.

Although the red-eye effect hasbeenknown for quitesometime, it is in recentyearsthat it hasgrabbeda lot ofattentionfor vision applications. Morimoto et al [16] havedescribeda systemto detectandtrackpupilsusingthered-eye effect. Haro et al [11] have extended this systemtodetectandtrackthepupils usinga Kalmanfilter andproba-bilistic PCA. We usean infraredcameraequippedwith in-frared LEDs, which is usedto highlight and track pupilsandis anin-housebuilt versionof theIBM Blue Eyescam-era(http://www.almaden.ibm.com/cs/blueeyes).Thewholeunit is placedunderthemonitorpointing towards theusersface.Thesystemhasan infraredsensitive cameracoupledwith two concentric rings of infrared LEDs. One set ofLEDs is on the optical axis andproducesthe red-eye ef-fect. Theothersetof LEDs, which areoff axis, keeps thesceneatabout thesameillumination. Thetwo setsof LEDsaresynchronizedwith thecameraandareswitchedon andoff to generatetwo interlacedimagesfor a single frame.Theimagewheretheon-axisLEDs areonhaswhitepupilswhereastheimagewheretheoff-axisLEDsareonhasblackpupils. Thesetwo imagesaresubtractedto geta differenceimage, which is usedto trackthepupils. Figure2 shows asampleimage,thede-interlaced imagesandthe differenceimageobtainedusingthesystem.

Thepupils aredetectedandtrackedusingthedifferenceimage, which is noisydueto theinterlacing andmotionar-

tifacts.Also, objects like glassesandearrings canshow upasbright spotsin the differenceimagedueto their specu-larity. To removethisnoisewefirst threshold thedifferenceimageusinganadaptive thresholding algorithm [11]. First,the algorithm computesthe histogramandthenthresholdstheimagekeepingonly0.1% of thebrightestpixels. All thenon-zeropixels in theresultingimagearesetto 255(max-val). The thresholded imageis usedto detectandto trackthepupils.Moredetailsonthepupil trackingsystemcanbefound in [14].

3.2 FeatureExtraction

For thepurposeof facialactionanalysis,we needto trackthefacialfeatures robustly andefficiently. Also, ratherthenjust tracking thepositions of facial features,we needto re-covertheparameters thatdrivetheshapeof thefeature.Thevariability in appearanceof facial featureschangesduetopose,lighting, facialexpressions etcmakingthe taskdiffi-cult andcomplex. Even harder is the taskof tracking thefacial features robustly in real time, without any manualalignment or calibration. Many previous approacheshavefocusedjust on trackingthe locationof the facial featuresor requiresomemanual initialization/intervention. In thissection,wedescribehow wecanrobustly recover theshapeof facialfeaturesin detailusingtemplatesin realtimewith-out requiring any manual intervention.

Our systemexploits the fact that it canestimatethe lo-cationof pupilsveryrobustly in theimage.Oncethepupilsare located,regionsof interest corresponding to eyesandeyebrowsarecroppedoutandanalyzedto recover theshapedescription. For the purposeof the facial actionanalysis,thefiducialpointsof thetemplatesdescribingeyesandeye-brows areconsideredasshapeparameters. Ourgoalis thento recover thesefiducial points in a new image. Assume,thatwe have a training setof imagevectors

������������� ��������,

whereeachimagevector���

is pre-annotatedwith a corre-sponding vector of shapeparameters � �

. For thepurposeoffacial actionanalysis,the images

�arecroppedimagesof

eyesandeyebrows andthevectorof shapeparameters � isastackof x,y coordinatesof fiducialpoints.

To recovertheshapeparametersin atestimage,say��������

,avery naiveapproachwill beto find animage,

�������� �, from

thetrainingsetof pre-annotatedimagesthatresemblesmostto

�������. Theshapeparametersof

������ �thencanbeapprox-

imatedby theshapeparameters� ������ �, whichcorresponds

to� ������ �

. This approachcannot generalize well, as therecanbeonly a finite numberof example imagesin thetrain-ing database.A moregeneral approach will be to repre-sentthetestimageasa linearcombinationof example im-ages. The samelinear combination canbe appliedto thecorresponding shapeparametersof the example images torecover the shapein the new image. Principalcomponent

3

Page 4: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

analysis(PCA) canbeusedto figureout therepresentationof thetestimagein termsof thelinearcombinationof exam-pleimages.Given ! exampleimages

� �, let � � ( "$#&% � � ! ) be

vectorscorresponding to themarkedcontrol pointsoneachimage. If

�is themeanimage,thenthecovariancematrixof

thetraining imagescanbeexpressedas:' #)(+* (-, where(&#/. � �10 �2���3�40 �2��� ��� ���3�50 �76Theeigenvectorsof

'canbecomputedby first computing

theeigenvectorsfor (+,1* ( . If 89#:. ; ��� ; �<��� � ; �=6where; �

representstheeigenvectorsof ( , * ( , thentheeigenvectors> �of

'canbecomputedas:? #/. > �@� > �@�A��� � � > �=6

= (+* 8As the eigenvectors are expressedas a linear combina-tion of example images,we canexpressthe shapeparam-eterscorrespondingto theeigenimagesusingthesamelin-earcombination. Let � be the meanof the vectors corre-sponding to the control points in example images and letB #9. � �C0 � ��� ��� � � �D0 � 6 , be the matrix of unbiasedshapeparameters.Then, theshapeparametersE� �

( "F#G% � � ! ) cor-respondingto aneigenvector > �

canbecomputedas:

. E� �H� E� ���� ��� � E� �=6 # B * 8To recover the shapeparameters in the test image,we

first expressthe new imageasa linear combination of theeigenvectors by projecting it ontothetop few eigenvectors.��<� I #KJ �&L � > ��M �

(1)

where L � #ON �P���� ��0 �3Q ,R* > �and > �

is the " �� eigenvec-tor.The samelinearcombinationis appliedto theshapepa-rametersof corresponding eigenvectorsto recover thenewshape.

� �� I #KJ � L � E� �4M � (2)

In brief, thetrainingsetis first processedoffline to com-putetherequiredeigenvectors.During thereal-timetrack-ing the croppedimagesof the eyesandeyebrows arepro-jectedon thecorresponding top few eigenvectors.Experi-ments showedthatfirst 40 eigenvectorsweregoodenoughfor thetaskandin our implementationwe usethose.Theseprojections are usedto recover the control points as ex-plained above. This approachworkedparticularly well onthesubjectswho hadtheir imagesin the trainingdatabase.This strategy is a simplification of the approach usedbyCovell et al[4, 5] andperformswell for thepurposeof thefacialfeature tracking, particularly on thesubjectswhohadimages in thetraining set.Notethatthereis noinitializationstep,whichwasverycritical in many templatematching ap-proaches.Further, thenon-iterative natureof theapproachmakesit idealto beusedin a real-timesystem.

Thisapproachis veryefficient,runsin realtimeat30fpsandis ableto trackupperfacialfeaturesrobustlyin presenceof largeheadmotionsandocclusions.Onelimitation of ourimplementationis thatit is not invariant to largezooming inoroutasfixedsizeimagesof thefacialfeaturesarecropped.Further, our training set did not have sampleswith scalechanges. Also in a few caseswith somenew subjects,thesystemdid not work well, asthe training imageswerenotableto spanthewholerange of variations in appearanceoftheindividuals.A trainingsetwhichcapturesthevariationsin appearanceshouldbeableto overcometheseproblems.

Figure3 show trackingresultsof somesequences.Thefirst subject appearing in Figure 3 was in the trainingdatabase.Thesystemis ableto trackthefeaturesvery well.Notethat in thefirst sequenceof Figure 3 the left eyebrowis not trackedin frames67,70and75asit is not present intheimage.Similarly all the templates arelost in the frame29 in thesecondsequenceof Figure3 whenthepupilsareabsent,asthe subjectblinks. The templates arerecoveredassoonas the featuresreappear. The secondsequence inFigure3 shows the trackingresultsfor the subjectsnot inthe training set. Again, note that the secondframein thefirst sequencedoesnot show any eyesor eyebrows, duetothe fact that the subjectblinked andhence no pupils weredetected. Thetrackingis recoveredin thevery next framewhenthepupils arevisible again.

Despitevariousadvantages,thisstrategy hassomeshort-comings.It assumesalinearrelationshipbetweentheimageandtheshapeparameters,which might not be thecaseal-ways.Also, it usesprincipal componentanalysisto recovertheshape,henceit inherentlyassumesthatthetopeigenvec-torscapture theshapevariations,whichis erroneous.Theremay be variations due to lighting which would contributehighly to theprincipal components.This strategy is a spe-cial caseof a Bayesianestimationframework andwe cancomeupwith amethodto recovershapesthatdoesnothavetheseproblems.Moredetailscanbefound in [14, 13].

3.3 Action Unit Classification

Oncetheparametersareextractedthenext stepis to iden-tify the facialactions they correspondto. Therearelots ofclassifiersthatcould beusedfor thepurposeof AU recog-nition. Support VectorMachines (SVM) have beenshownto perform well onanumberof classificationtasks.SVM isanoptimal discriminant methodbasedonBayesianlearningtheory andgeneralizeswell. Classifiersbasedon supportvector machine(SVM) perform binary classificationbyfirstprojecting the datapointsontoa linearly separable featurespaceandthen, usinga hyperplane that is maximallysep-aratedfrom the nearestpositive and negative datapoints.Mathematically, given a set of S training data NTVU �2W U Q ,where TXUZY\[^] with corresponding label

W U_Y � % ��0 % � ,

4

Page 5: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

Frame57 Frame67 Frame70 Frame75 Frame88

Frame59 Frame60 Frame61 Frame68 Frame87

Figure3: UpperFacialFeatureTracking.

thesupport vector machine classifiesa datapoint T using,

` NT Q #baJ U�c �ed U W U "fNT � T U Qg0ih

Here "fNPT � TjU Q is a positive definitekernel function andspecifiesaninnerproduct betweenT and T U in the linearlyseparable featurespace.The T U ’scorresponding tonon-zerod U ’s aresupport vectors. The d U ’s andthebias

hcanbeob-

tainedby solvinganoptimizationproblem. For thepurposeof classifyingfacialactions,T is thevectorof relevant shapeparametersandthesignof

` NT QdetermineswhetheranAU

hasbeenrecognizedor not.Thereareover 40 different facial actionunits enumer-

atedin FACS [9] andmore than7,000different AU com-binations have beenobserved. A systemthat aims to an-alyzefacesshouldnot only recognize a singleAU but thecombinationsof AUs aswell. TheAU combinationscanbeadditive or non-additive. Theappearanceof AUs doesnotchange whenadditive combinationof AUs occur, whereasin non-additive combinations,theappearanceof individualAUs doeschange. In our systema separateSVM for eachAU is trainedusingexamples.WehaveusedCawley’sSVMtoolbox [3] to train theSVMs to classifythe facial featureparametersthat correspond to an occurrenceof a particu-lar AU from the onesthat don’ t. During the recognitionphase,the facial feature parametersin eachframearefirstnormalizedto accountfor variability dueto change in pose,zoom and personal variations. Thesenormalized param-

etersare then subtracted from the normalizedparameterscorresponding to aneutralframe.Thisdifferenceis usedasinput featuresto the SVMs to figure out which AUs werepresent. Also ratherthanusingall theshapeparameters,weuseonly thosethataremostindicativeof theactionunit thatwearetrying to recognize.Weuseparametersthatdescribeeyebrows to recognizeAU 1, 2 and4 andtheeye parame-tersfor AU 5 and7. Thenext sectiondescribesevaluationof thesystemandtheresults.

4 Experimental Evaluation Results

Thesystemwasevaluatedontwo datasets.Thefirst datasetis completely naturalwith lots of headmotions,occlusionsandposechanges.Theresultsareindicative of how a sys-tem like this would work in real-world applications. Theseconddatabasewe use is Cohn-Kanadefacial database[12], which hasbeenpreviously usedto test facial actionrecognition systems.The resultson this databaseindicatehow doesourresultscompareto themethodsthathavebeentrainedandtestedon theCohn-Kanadedatabase.

The natural facial action databasehas 8 children in areal learning situation. Thesechildren wereasked to playagamecalledthefripple place[8]. Thegamehasanumberof puzzles that requires mathematical reasoning. Eachkidworked on thesepuzzles for about20 minutes. Videosoftheir faceswererecordedby two cameras.A visioncamerawasplacedon top of the monitor andan IBM Blue Eyes

5

Page 6: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

camera wasplacedunderthemonitor. A FACStrainedex-pert codedthe videos of the facefor various actionunitsand80frameswereselectedfrom theseFACScodedvideosof the kids. Theseframeswere selectedmanuallyto en-surethattherewereequalnumber of samplesof thediffer-ent facialactionunits from all thekids. TheCohn-Kanadedatabaseis comprehensivedatabasecollectedandcoded byateamof researchersandconsistsof adults performingase-riesof facialexpressionsin front of a camera.Our trainingandtestingdatabase(CMU database)hasvideosequencesof 25 individuals,whereeachvideosequencestartswith aneutral frame.Thedetailsof boththedatasetsareshown inTable3 and4.

Table3: Detailsof instancesof AUs in thedataset

Action Unit # of instances # of instancesin ourdatabase in CMU database

1 35 372 33 274 16 585 24 217 13 27

Neutral 19 0

Total 140 170

Table4: Detailsof AU combinationsin thedataset

AU # of samples # of samplesCombination in ourdatabase in CMU database

1+2 121+2+5 191+2+7 21+4 24 105 57 7

4+7 4Neutral 19

Total 80

The systemis evaluatedfor recognition accuracy usingleave-one-subject-outcrossvalidation. Theclassifiersweretrainedusingthe datafrom all but onesubjectandreserv-ing the onesubjectfor testing. This was repeatedfor allsubjectsin thedatabase.Thesystemcouldrecognizeeachindividual AU with anaccuracy of 69.29%, whereasanac-curacy of 62.5% wasobtainedfor all AU combinations.Ta-ble 5 shows how well eachindividual AU wasrecognized

Table5: Resultsfor individualAUs in ourdatabase

Action # of Correct % CorrectUnit Samples Recognition Misses Recognition

AU 1 35 26 9 74.3%AU 2 33 26 7 78.8%AU 4 16 9 7 56.2%AU 5 24 16 8 66.7%AU 7 13 6 7 46.1%

Neutral 19 14 5 73.3%

Total 140 97 43 69.29%

Table6: Resultsfor individualAUs in CMU database

Action # of Correct % CorrectUnit Samples Recognition Misses Recognition

AU 1 37 27 10 73.0%AU 2 27 25 2 92.6%AU 4 58 47 11 81.0%AU 5 21 14 7 66.7%AU 7 27 27 0 100.0%

Total 170 140 30 82.35%

andTable7 showshow well eachAU combinationwasrec-ognized.Although theresultsmight notsoundexceedinglygood, we needto keepin mind that theseresultsare re-ported on a natural dataset;this set is very different fromthe datasetsusedto evaluate earlier systems. The videoshave a lot of occlusionandheadmovements,which makestheproblemmuchharderthanondatasetswheretheimagesarepre-processedandmanuallynormalized.

On theCMU databasethesystemcouldrecognizeeachindividualAU with anaccuracy of 82.35%andall AU com-bination with anaccuracy of TBD%. Table6and8show thecompleteresults.Theresultsarecomparableto resultspre-viously reportedon thesamedatabaseby Tian etal [17]. Alot of earlierwork in faceanalysisreportedveryhighrecog-nition resultsandat first glancetheresultsreportedhereonthenatural databasemightseeminsignificant. But, wehaveto keepin mind that mostof the earlierwork hasfocusedon frontal video of the faceshot in ideal conditions. Thesystemswere trainedand testedat the apex of emotionalexpressionandrequired human intervention. Consideringthatanaccuracy of 75%amongthehumanFACScoders isconsideredgood, our systemperformance is comparabletothatof humans.In real-world applicationsthefaceanalysissystemshouldbefully automatic andshouldnotrequire any

6

Page 7: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

Table7: Resultsfor AU combinationsin ourdatabaseActual # of Fully Partially % FullAUs Samples Recognized Recognized Correct1+2 12 9 1 75%

1+2+5 19 11 3 57.9%1+2+7 2 1 1 50%1+4 2 0 2 0%4 10 5 0 50%5 5 5 0 100%7 7 3 0 42.9%

4+7 4 2 1 50%Neutral 19 14 0 73.7%

Total 80 50 8 62.5%

Table8: Resultsfor AU combinationsin CMU databaseActual # of Fully Partially % FullAUs Samples Recognized Recognized Correct1+2 12 TBD TBD TBD

1+2+5 19 TBD TBD TBD1+2+6+7 2 TBD TBD TBD

1+4 2 TBD TBD TBD4 10 TBD TBD TBD5 5 TBD TBD TBD7 6 TBD TBD TBD

4+7 4 TBD TBD TBD6+7 1 TBD TBD TBD

Total 80 TBD TBD TBD

human intervention, which is challenging dueto the pres-enceof headmovements,posevariationsandocclusions ina natural scenario.This systemis evaluatedin thesechal-lenging conditions andhence,the resultsreported are thestateof theart for naturalhuman-computerinteraction.

5 Conclusionand Futur eWork

This paperdemonstratesa fully automaticframework thatcanrecognizeupperfacialactionunits.Thisframework canbeusedin scenarios wherethemachineneedsa perceptualability to recognize,model andanalyzethefacialactivity inrealtimewithout any manual intervention.Thesystemfirsttracks thepupil positions robustly usingthered-eye effect;thesepositionsarethenusedto localizeeyesandeyebrows.Theshapeparameterscorresponding to thesefacialfeaturesarerecoveredusingPrincipalComponentAnalysis (PCA).Oncethe parameters describing the facial features are re-covered,they areusedto recognize thefacialactions. Sup-port vectormachines(SVMs) areusedto recognize facialactions anda recognition accuracy of 69.29% for eachin-dividual AU is reported. Thesystemcancorrectly identify

all possibleAU combinationswith anaccuracy of 62.5% ina realandfully natural dataset.Thedatasetusedfor evalu-ationis completelynaturalandthepaperdemonstrateshowcomputervision andmachinelearningcanbeintegratedtobuild real-world applications.

The framework suggestedin this thesishassomelimi-tations. The systemdepends uponthe robust pupil track-ing, which currently breaks whenthesubjectsarewearingglasses.The patternrecognition to find pupils canbe fur-therrefinedto trackthepupilsevenwhentherearesubjectswith glasses.SincethesystemusesinfraredLEDs, it is un-usablein the presenceof a bright infrared source(like thesun)andalternatepupil tracking techniquesthatdonot relyon infraredshouldbe explored. It is also possibleto re-fine the shapeparameterextraction by taking into accountzoomandvariations dueto posechanges. Thesystemcanalsobeextended to tracklower facial features, like thelipsandnose.Thesystemcanbe extendedto recognize lowerfacial action units as well. The faceis a very importantchannel that emitssignalsrelatedto the internal stateanda lot of effort is beingdevoted to unravel this relationship.Besidesbeingusedasaman-machineinterface, this frame-work would hopefully be useful to a lot of theseresearchefforts aswell .

References[1] M. A. Bartlett, J. C. Hager, P. Ekman,and T. Sejnowski.

Measuringfacial expressionsby computerimageanalysis.Psychophysiology, 36(2):253–263,March1999.

[2] M. BlackandY. Yacoob. Trackingandrecognizingrigid andnon-rigidfacialmotionsusinglocalparametricmodelof im-agemotion. In Proceedingsof theInternational ConferenceonComputerVision, pages374–381,Cambridge,MA, 1995.IEEEComputerSociety.

[3] G. C. Cawley. MATLAB supportvector machinetoolbox(v0.50k ) l http://theoval.sys.uea.ac.uk/˜gcc/svm/toolboxm .University of EastAnglia, Schoolof InformationSystems,Norwich,Norfolk, U.K. NR4 7TJ,2000.

[4] Michele Covell. Eigen-points. In Proceedingsof Interna-tional Conferenceon Image Processing, September1996.

[5] Michele Covell. Eigen-points:control-point locationusingprincipalcomponentanalyses. In Proceedingsof Conferenceon AutomaticFaceandGesture Recognition, October1996.

[6] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis,G. Votsis,S. Kollias, W. Fellenz,andJ. G. Taylor. Emotionrecogni-tion in human-computerinteraction.IEEESignalProcessingMagazine, 18(1):33–80,January2001.

[7] G. Donato, M. Bartlett, J. Hager, P. Ekman, and T. Se-jnowski. Classifyingfacial actions. IEEE Pattern AnalysisandMachineIntelligence, 21(10):974–989,October1999.

[8] Edmark. Fripple place.http://www.riverdeep.net/edconnect/softwareactivities/critical thinking/fripple place.jhtml.

7

Page 8: Fully Automatic Upper Facial Action Recognitioneryday life. A person might communicate more with subtle facial actions like frequent frowns or smiles. Further there are emotions like

[9] P. Ekman and W. V. Friesen. The Facial Action CodingSystem:A Techniquefor Measurement of Facial Movement.ConsultingPsychologistsPress,SanFrancisco,CA, 1978.

[10] I. EssaandA. Pentland.Coding,analysis,interpretationandrecognition of facialexpressions. PatternAnalysisandMa-chineIntelligence, 7:757–763,July 1997.

[11] A. Haro, I. Essa,andM. Flickner. Detectingand trackingeyesby usingtheir physiological properties.In Proceedingsof Conferenceon ComputerVision andPatternRecognition,June2000.

[12] T. Kanade,J.F. Cohn,andY. Tian. Comprehensivedatabasefor facialexpression analysis.In Proceedingsof Conferenceon AutomaticFaceandGesture Recognition, 2000.

[13] AshishKapoor. Automaticfacial actionanalysis. Master’sthesis,MassachusettsInstituteof Technology, 2002.

[14] AshishKapoorandRosalindW. Picard.Real-time,fully au-tomaticupperfacialfeaturetracking.In Proceedingsof Con-ferenceon AutomaticFace and Gesture Recognition, May2002.

[15] J.Lien,T. Kanade,J.Cohn,andC.C.Li. Detection,trackingandclassificationof actionunitsin facialexpression.Journalof RoboticsandAutonomousSystems, 31:131–146,2000.

[16] C. Morimoto,D. Koons,A. Amir, andM. Flickner. Pupilde-tectionandtrackingusingmultiple light sources.Technicalreport,IBM AlmadenResearchCenter, 1998.

[17] Y. Tian, T. Kanade,andJ.F. Cohn. Recognizingupperfaceaction units for facial expressionanalysis. In Proceedingsof Conferenceon ComputerVision andPatternRecognition,June2000.

[18] Y. Tian,T. Kanade,andJ.F. Cohn.Recognizingactionunitsfor facialexpressionanalysis.PatternAnalysisandMachineIntelligence, 23(2),February2001.

[19] Y. YacoobandL. Davis. Computingspatio-temporalrepre-sentationof humanfaces. In CVPR, pages70–75, Seattle,WA, June1994.

[20] Z. Zhang.Feature-basedfacialexpressionrecognition: Sen-sitivity analysisand experimentswith a multilayer percep-tron. InternationalJournal of PatternRecognition andArti-ficial Intelligence, 13(6):893–911,1999.

8