Group Emotion Recognition with Individual Facial Emotion ... Group Emotion Recognition with...
Embed Size (px)
Transcript of Group Emotion Recognition with Individual Facial Emotion ... Group Emotion Recognition with...
Group Emotion Recognition with Individual Facial EmotionCNNs and Global Image based CNNs
Lianzhi TanShenzhen Institutes of AdvancedTechnology, Chinese Academy of
Kaipeng ZhangNational Taiwan University
Kai WangShenzhen Institutes of AdvancedTechnology, Chinese Academy of
Xiaoxing ZengShenzhen Institutes of AdvancedTechnology, Chinese Academy of
Xiaojiang PengShenzhen Institutes of AdvancedTechnology, Chinese Academy of
Yu QiaoShenzhen Institutes of AdvancedTechnology, Chinese Academy of
ABSTRACTThis paper presents our approach for group-level emotion recogni-tion in the Emotion Recognition in the Wild Challenge 2017. Thetask is to classify an image into one of the group emotion such aspositive, neutral or negative. Our approach is based on two types ofConvolutional Neural Networks (CNNs), namely individual facialemotion CNNs and global image based CNNs. For the individualfacial emotion CNNs, we first extract all the faces in an image, andassign the image label to all faces for training. In particular, weutilize a large-margin softmax loss for discriminative learning andwe train two CNNs on both aligned and non-aligned faces. For theglobal image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boostperformance. For a test image, we average the scores from all facesand the image to predict the final group emotion category. We winthe challenge with accuracies 83.9% and 80.9% on the validation setand testing set respectively, which improve the baseline results byabout 30%.
CCS CONCEPTS Computing methodologies Image representations;
KEYWORDSEmotion Recognition, Group-level emotion recognition, deep learn-ing, Convolutional Neural Networks, large-margin softmaxACM Reference Format:Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng,and Yu Qiao. 2017. Group Emotion Recognition with Individual FacialEmotion CNNs and Global Image based CNNs. In Proceedings of ICMI 17,Glasgow, United Kingdom, November 1317, 2017, 5 pages.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from email@example.com.ICMI 17, November 1317, 2017, Glasgow, United Kingdom 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-5543-8/17/11. . . $15.00https://doi.org/10.1145/3136755.3143008
1 INTRODUCTIONFacial emotion recognition haswide applications in human-computerinteraction, virtual reality, and entertainment. It is a challengingproblem due to the variance of actors, large head poses, illumina-tion, occlusion, etc. Recently, group-level emotion recognition inimages has attracted increasing attention , which is even morechallenging because of the clutter backgrounds and low-resolutionfaces. Group-level emotion refers to the moods, emotions and dis-positional affects of a number of people. The task in the paper is toclassify a groups perceived emotion as positive, neutral or negative.It is especially useful for social image analysis and user emotionprediction.
Related work. Huang et al.  proposed Reisz-Based volumelocal binary pattern and a continuous conditional random fieldsmodel. Mou et al.  proposed group-level arousal and valencerecognition from view of face, body and context. Unaiza et al. used Hybrid-CNN to infer image sentiment of social events. Inthe Group based Emotion Recognition in the Wild (EmotiW) 2016challenge , happiness is measured in level 0 to 5. The winnerproposed a scene feature extractor and a series of face feature ex-tractors based LSTM for regression  . The second proposed abottom-up approach using geometric features and Partial LeastSquares regression  . The third proposed LSTM for DynamicEmotion and Group Emotion Recognition in the Wild . Theorganizers refer to global and local image features as top-downand bottom-up components as baseline. In their work , globalfeatures contain scene features related to factors external to groupmembers characteristics while local features contain face expres-sions, face attributes which related to intrinsic characteristics ofthe individuals in the group. However, their proposed method re-lying on LBQ and PHOG features and CENTRIST, whose captureface representation and scene representation is limited. Therefore,we propose an overall deep group emotion recognition frameworkthat combines deep face level representation and deep scene levelrepresentation to infer group emotion.
ICMI 17, November 1317, 2017, Glasgow, United Kingdom Lianzhi et al.
Figure 1: The system pipeline of our approach. It contains two kinds of CNNs, namely the individual facial emotion CNNs andthe global image based CNNs. The final prediction is made by averaging all the scores of CNNs from all faces and the image.
2 OUR APPROACH2.1 System pipelineOur system pipeline is shown in Figure 1. It contains two typesof convolutional neural networks, one based on images and theother based on faces. In particular, we train two CNNs based onfaces with alignment and without alignment. We fuse the scoresof CNNs from all faces and the image to predict the final groupemotion category.
2.2 Individual facial emotion CNNsWe realize that the facial emotion for each face is themost importantcue for group emotion recognition. To this end, we utilize twocomplementary CNNs to exploit this part, namely the aligned facialemotion CNN and non-aligned facial emotion CNN. We first brieflyintroduce the used face detector and the large-margin softmax lossas following.
Face Detection. We use MTCNN  to detect faces in theimages. MTCNN is a CNN-based face detection method. It usesthree cascaded CNNs for fast and accurate detection and joints facealignment (five facial landmarks detection, i.e. two eyes, two mouthcorners and nose) and face detection learning. It builds an imagepyramid according to the input images and then feed them to thefollowing three-stage cascaded framework. The candidate regionsare produced in the first stage and refine in the latter two stages.The final detection results and according facial landmark locationare produced in the third stage.
Large-margin softmax loss (L-Softmax). L-Softmax  is in-troduced for discriminative learning and can alleviate the overfit-ting problem. L-Softmax can encourage intra-class compactnessand inter-class separability between learned features by angularmargin constraint. In the fine-tuning, for a face feature xi , the lossis computed by:
Li = loge | |wyi | | | |xi | | (yi )
e | |wyi | | | |xi | | (yi ) +j,y e
| |wyi | | | |xi | | cos j(1)
where yi is the label of xi , wyi is the weight of j class in a fully-connected layer, and
cos( j ) =wTj xi
| |w j | | | |xi | |, (2)
Table 1: The architecture of our aligned facial emotion CNN.
Type Filter Size/Stride Output Size
Conv1.x 3 x 3/2 48 x 56 x 64[ 3 3/13 3/1
] 3 48 x 56 x 64
Conv2.x 3 x 3/2 24 x 28 x 128[ 3 3/13 3/1
] 8 24 x 28 x 128
Conv3.x 3 x 3/2 12 x 14 x 256[ 3 3/13 3/1
] 16 12 x 14 x 256
Conv4.x 3 x 3/2 6 x 7 x 512[ 3 3/13 3/1
] 3 6 x 7 x 512
FC1_bn 512 512
FC2 3 3
( ) = (1)k cosm 2k, [km,(k + 1)
where m is the pre-set angular margin constraint,k is an integerand k [0,m 1].
2.2.1 Aligned facial emotion CNN. Talbe 1 shows the architec-ture of our aligned facial emotion CNN. Ths model takes 96 112RGB aligned face images as input, feeds them into four stages ofconvolutional layers and two fully-connected layers. We use simi-larity transform based on five detected facial landmarks to align thefaces. To alleviate the overfitting and enhance the generalization,we pre-train it using face recognition dataset and then fine-tune inEmotiW 2017 training dataset using L-softmax loss.
2.2.2 Non-aligned facial emotion CNN. Following , we traina non-aligned facial emotion CNN which is complementary to thealigned facial emotion CNN in our case. This model is based on theResNet34 , and takes 48 48 gray face images as inputs.
Inspired by , we pre-train themodel on the FERPlus expressiondataset  where all the faces are cropped to 48 48. Some samplesof this dataset are shown in Figure 2. We use MTCNN to detectfaces and crop the faces to 48 48.
Group Emotion Recognition with Individual Facial and Image based CNNs ICMI 17, November 1317, 2017, Glasgow, United Kingdom
Figure 2: Some samples of the FERPlus dataset.
2.3 Global image based CNNsIn some sense, the global feature also refects the group-level emo-tion. For example, an image taken from a wedding party is likely tobe positive and from a funeral negative. To this end, we train globalimage based CNNs with sevaral state-of-the-art network architec-tures, namelyVGG19 , BN-Inception , andResNet101 .
3 EXPERIMENTSIn this section, we first present the EmotiW 2017 dataset  andthe implementation details, and then show the evaluations of themetioned CNNs, finnaly we evaluate socre combinations.