Facial Expression Recognition in Static Images

Facial Expression Recognition in Static Images

Facial Expression Recognition in Static Images

Ting-Yen Wang and Ting-You Wang

Advised by Jiun-Hung Chen

AbstractWith the advent of the Viola-Jones face detection algorithm, the computer vision community has a pretty good method for detecting faces. However, currently, there is actually very little published research dealing with detecting facial expressions on still images, as most publications are focused on detecting facial expressions in video images. This paper is focused on our three-week long research project on facial expression detection on still images using different combinations of image processing methods and machine learners. The two ways we processed images were using the raw pixels and eigenfaces; two different machine learners, K-Nearest Neighbors and Support Vector Machines, classified the processed images. Our results indicate that the detection of facial expressions on still images appears to be possible, as using raw images with Support Vector Machines turned out some very promising results that should lead to further research and development.

IntroductionThere have been many advances in face detection; however, the area of expression detection is still in its early stages. There has been a great deal of work done in this area, and even applications of it. For example, Sony cameras have their Smile Detection that is supposed to detect when a person in the image is smiling (http://www.gadgetbb.com/2008/02/27/sony-dsc-t300-first-camera-with-smile-detection/). Others who have done work in this field of research include CMU (http://www.pitt.edu/~emotion/research.html) and BMW (Bimodal Fusion of Emotional Data in an Automotive Environment, S. Hoch, F. Althoff, G. McGlaun, G. Rigoll). Such research has been focused on detecting when a face becomes a particular expression. That is, it video sequences a face and calculates changes in the image from a neutral state to determine if the face has become another state note that there are generally seven categories of expression: Anger, Disgust, Fear, Happy, Neutral, Sadness, and Surprise.

We do not always have the luxury of having a sequence of images from a persons neutral state; in this paper we discuss our research concerning the feasibility of detecting an expression from a still image. We combine various techniques for finding and describing a face, such as Viola-Jones, machine learners, and eigenfaces. In the following sections, we will discuss the process of creating our classifiers, testing the classifiers along with results and conclusions, and we will end with some future work.

Process

The first step was to detect faces within an image, which we hope to classify. To do this we leverage the Viola-Jones face detection algorithm in OpenCV. The OpenCV repository contains a Haar cascade classifier that will find frontal faces. Each of these faces is then saved to a file to be processed by a classifier.

The next step was to create some classifiers for the various expressions we are going to classify. We first began with a simple Smile and No smile binary classification. To begin to see if it was even remotely possible to classify images, we started with the small class images from CSE 576 project 3. This image database contains 17 smiling images and 17 non-smiling images. The next step was to move to a much larger database, provided by CMU, with over 8000 different images with 7 different classifications: Happy, Sad, Anger, Fear, Disgust, Surprise, and Neutral.

The classifiers that we chose to use were K Nearest Neighbor and Support Vector Machine. For each classifier, we chose to use two types of features to train our classifiers. The first feature vector was simply the raw image grayscaled and resized to 25 by 25 (resulting in a feature vector of length 625). This would provide a baseline for all other future classification methods. The second feature vector is an expansion of eigenfaces, using the coefficients of those eigenfaces used to project a face as the feature vector. For this method, we used about 70 images 10 images from each expression to create the eigenfaces and saved the top 30.

In order to keep images consistent for classification, we used the Viola-Jones face finder on the training images to crop out the face. These faces would generally have the same bounding box around the face as other faces found from test images. One issue that we ran into by using this method is that Viola Jones finds many false positives so we had to manually delete all the false positives it identified.

After we created all the training images, the next step is to actually train the classifiers. In order to test the classifier, we also wrote a script that for every 10 images, we remove 1 from the classification set and save it for cross validation testing. The classifiers were created using libraries from OpenCV. The KNN library that is provided is fairly straightforward and we simply input each feature vector with a classification number. The SVM library was more complicated and had many configurations that had to be set up before it could be run. For the purposes of our project, we used some basic defaults. This myriad of configurations provides a lot of room to fine-tune the SVM approach to improve the results (for example, we had to change weights for the different classes to account for classes that are less represented).

After setting up the classifiers, the final step was to run tests on some test images our first test was to use the images we reserved for cross-validation and our second test used images completely unrelated to any images in the training set. We will go into further details about the results we saw in the Results section, but we did notice that certain classifications were much more likely than others, such as the neutral face seemed to dominate others. Therefore, we had to change certain weights to avoid misclassifying the non-neutral classes.The general workflow is:

We eventually found that with seven classes, it turns out that using raw images with SVM provided the least number of misclassifications while using eigenfaces was less accurate. Since there were still some misclassifications in our best method, we then focused on eliminating misclassifications by improving the Smile/No Smile classification in the time provided. One of the problems we believe was that neutral faces were overrepresented in the CMU database by an order of magnitude more than any other class. Thus, we reduced the training size of the neutral faces and combined it with angry, fear, disgust, sad, and surprise to form the non-smiling class. This continued to exhibit the same problem as before, classifying nearly everything as neutral. The problem was that certain classes of faces are too similar to happy, particularly in the aspect of an open mouth with teeth showing. This led us to remove anger and fear from the non-smiling class and this provided better results.

Finally, we looked into the use of contours, since this would allow us to really focus on the curve of the mouth. However, we determined that it might not very suitable, since in order to find the mouth, we need to lower the gradient threshold; however, the lowering of the threshold allowed a lot of undesired edges to show up in the image including teeth, creases around the face, shadows, and hair, making classification fairly difficult. At the other end, using a fairly high threshold caused us to miss most of the mouth while we still get a lot of noise from the image (creases in around the mouth and shadows).

This is not to say that the method will not work, but it would require careful tuning and possibly more image processing to get good results.

ResultsTwo-classes

Initial results showed that this method of facial expression recognition is promising. For Smile versus No smile (no smile being all images that were not Happy), using KNN on plain images as feature vectors resulted in 11/18 of the smiles correct and the 122/130 of the non-smiling correct when feeding the cross-validated (remove one for every 10) data back in. SVM was even better with only one incorrect classification out of 148 tests. These results show that the faces in the CMU database can be classified fairly well.

Though the accuracy was very good, the problem with these results was that they were somewhat biased. After we examined the training data we got from CMU, we discovered that there were actually a lot repeated faces so that for any face, there was a face very similar to it in the database of the right classification making it easier for SVM and KNN to classify the images. For external images, they were almost always classified as non-smiling (more specifically, neutral), which seemed to become the classifiers average face for faces it does not recognize. This made some sense in that for any person, their average face is the neutral face and this expression, or lack thereof, was overrepresented in the database.

To avoid using the repeated faces, we created a new test set to classify with faces that did not have any hint of similarity with the images in our database. We then bumped the weights for smiling (Happy) up to compensate for the overrepresentation of non-smiling faces. We also removed some faces that looked too much like smiling specifically, we removed the Fear and Angry faces from the non-smiling class. We then retried SVM on our new test data with the modified weights and the results were promising: It was still able to classify 96/101 correctly. Nearly all of the errors were in the fear images, which should be labeled as non-smiling. Since they were no longer in the database, many moved to smiling. We also retried KNN on our new test data (there was no weight adjustment as KNN does not use weights) and it was able to classify 92/101 correctly. KNN had its errors spread out throughout the test cases because we could not weight a certain classification more heavily than others to avoid misclassifications of a certain expressions.Using eigenface coefficients as feature vectors did not produce better results; rather they were worse in general. However, it does run much more quickly since we only have 30 feature points for each image. But probably due to this same fact, 30 feature points was not enough to capture an expression and most expressions were absorbed into neutral. To quickly summarize, using the cross-validation technique, we classified 17/18 smiling correct, and 126/130 non-smiling correct. So, if a face is in our database, the eigenface method is very good at finding that persons expression again. However, if we move to people not represented in the database, nearly all results go to non-smiling. Only 2/20 images classified smile correctly on our new data set. This showed that eigenfaces does not capture a general expression very well.Three-classesTo make the system slightly more complex, we added surprise as a third class for plain images. We did not perform this for eigenfaces since the results were not very good for just two classes. Using raw images, SVM continued to do fairly good job at classifying.Raw Images with SVM using 3 Expressions (smile vs. surprise vs. non-smile)

Students do not exist in database; Classes Weighted; No Fear/Anger; Using half of the neutral images

SmileSurpriseNon-Smile

Smile95% (19)5% (1)

Surprise94.7% (18)5.5% (1)

Non-Smile4.8% (3)95.2% (59)

KNN, on the other hand, began to have problems differentiating the neutral from surprise.Raw Images with KNN using 3 Expressions (smile vs. surprise vs. non-smile)

Students do not exist in database; No Fear/Anger; Using half neutral

SmileSurpriseNon-Smile

Smile75% (15)25% (5)

Surprise42.1% (8)57.9% (11)

Non-Smile8.1% (5)91.9% (57)

Seven-classes

Finally, we moved to classifying all seven categories, and we experienced a very similar situation. When feeding the CMU database back into the system, the KNN search was not very good, hovering around 50% for each category, classifying most expressions as neutral:

Raw Images with KNN using 7 Expressions

Students already exist in database

AngerDisgustFearHappyNeutralSadnessSurprise

Anger 42.9% (3)57.1% (4)

Disgust63.6% (7)36.4% (4)

Fear40% (4)10% (1)50% (5)

Happy61.1% (11)38.9% (7)

Neutral100% (71)

Sadness60% (9)40% (6)

Surprise62.5% (10)37.5% (6)

The problem was that neutral was polluting all the other categories. Perhaps tuning K (currently 10) may produce better results.For SVM, at first, the neutral faces dominated everything, classifying nearly everything as neutral. With a few modifications to the weights of the non-neutral classes, we were able to get nearly 100% accuracy. Raw Images with SVM using 7 Expressions

Students already exist in database; Classes Weighted


Anger 100% (7)

Disgust100% (11)

Fear90% (9)10% (1)

Happy100% (18)

Neutral100% (71)

Sadness100% (15)

Surprise100% (16)

Keeping the same weight modifications, we then tested the SVM approach with the external faces (those that had no relation to our training images). Surprisingly, the results were very promising, as were able to successfully classify the expression about 90.1% of the time with an average accuracy across the expressions of 88.6%! Raw Images with SVM using 7 Expressions

Students do not exist in database; Classes Weighted


Anger 100% (3)

Disgust71.4% (5)28.6% (2)

Fear90.9% (10)9.1% (1)

Happy5% (1)90% (18)5% (1)

Neutral3.45% (1)93.1% (27)3.45% (1)

Sadness16.67% (2)8.33% (1)75% (9)

Surprise100% (19)

When using the new test data, KNN was unable to make good classification and produced many misclassifications, only 55.4% accuracy and average accuracy of only 35%.Raw Images with KNN using 7 Expressions

Students do not exist in database


Anger 100% (3)

Disgust14.3% (1)14.3% (1)71.4% (5)

Fear9.1% (1)27.3% (3)63.6% (7)

Happy65% (13)35% (7)

Neutral96.55% (28)3.45% (1)

Sadness8.33% (1)75% (9)16.67% (2)

Surprise21.1% (4)26.3% (5)52.6% (10)

We also tried eigenfaces here, and saw the same results as with the two classes. For test subjects with a similar expression of their own in the database, the method performed pretty well, but on new test subjects, nearly all became neutral.Eigenfaces Images with SVM using 7 Expressions

Students do not exist in database; Classes Weighted


Anger 100% (3)

Disgust14.3% (1)85.7% (6)

Fear100% (11)

Happy15% (3)85% (17)

Neutral100% (29)

Sadness100% (12)

Surprise84.2% (16)5.3% (3)

Though the results were poor, it is possible that with some modifications to the weights, altering some other parameters of SVM, and changing the number of eigenfaces, we may be able to achieve an unbiased classifier using eigenfaces. Experience

Although we were eventually able to obtain promising results with our facial expression classification algorithms, reaching this point was not an easy one. As students that are completely new to the field of Computer Vision, it was difficult to come up with a research idea that was based in Computer Vision and could be feasibly done in three weeks. Being able to discuss our ideas with the Professor and the Teaching Assistant was very helpful during this portion of the project. However, we note that the opportunity to embark upon whatever we wanted had an enjoyable aspect to it; we took this opportunity to let our imaginations attempt to mix what we had learned in class with what we desired to create.

The next portion of the project coming up with an implementation plan magnified our Computer Vision inexperience, which also made our TAs (Jiun-Hung Chen) help and advice invaluable! We ended up desiring to work on a project with very little published research, so we found ourselves coming up with different solutions from scratch. Due to our time limitations and lack of experience, the opportunity to go over our ideas and their feasibility with Jiun-Hung was extremely helpful. Jiun-Hung helped us to focus our three weeks on a couple of promising paths instead of wasting time working on a series of inept algorithms.

With some ideas with which to go about working on our project, we were equipped to face the challenges of implementation, most significantly, implementing Viola-Jones and the machine learners and constructing the databases. The hardest part with implementing Viola-Jones and the machine learners was the process of learning how to incorporate and use OpenCV effectively. For example, for the OpenCV machine learners, we had to become competent with the many complicated functions needed for KNN and SVM despite the lack of good examples and documentation in fact, there were actually no examples of SVM that we could use. This need for competency at using the OpenCV functions would become especially important during the testing phase of our project where we would have to tune our expression recognition system, especially the machine learners. Also, we had to discover and work around the quirks of OpenCV; for example, a number of the machine learning functions declared in their .h header files actually were not implemented in OpenCV. Despite these challenges, we really appreciated the provision of OpenCV as a tool to implement complex algorithms like Viola-Jones and SVM, as it allowed us to focus our project on actually trying to solve a problem (classifying facial expressions) instead of just trying to get an algorithm to work (such as Viola-Jones). Constructing the databases was one of the most time consuming portions of this project. Perhaps the most time consuming part was sorting our sample images into the proper expression classifications (i.e. Anger, Fear, Disgust, etc.) in our case, around 1500 images for training and a few hundred more for testing. We then needed to format the images according to how our facial recognition system expected the images; the raw image system used JPG images, whereas the eigenfaces method used TGA images.

After completing the implementation of our system and creating our image databases, the testing portion of the project turned out to be both the most depressing and the most exhilarating part of the project. Since we were embarking on a novel research project, we experienced a very large number of failed attempts at classifying facial expressions before finally arriving at a point where we were successfully classifying expressions! After connecting all of our project components (databases, Viola-Jones, machine learners, image processors like eigenfaces), we used a number of image testing sets and each time we discovered different ways to both fail our system as well as ways to tune our system. At times, it seemed as if we would never be able to get our system to work correctly with some of our more complicated test sets. However, after many long hours of head scratching and intense labor, finally seeing a working system allowed us to transform our former grief into amazing exhilaration!

Overall, this project did provide a valuable experience in the field of Computer Vision. We were challenged at every step, from idea formulation, to implementation, to testing. However, with the advising from our TA, Jiun-Hung Chen, and spending many hours working through these challenges, we were able to grow in many aspects of Computer Vision. In the end, finally developing a working artifact has really encouraged us to continue looking into the field of Computer Vision, perhaps even as a research area! Future Work

The work we did showed promise that it is possible to take a still image and determine the expression on the face, however, there is still much work to be done. In the limited time that we had, we provided a couple of baseline examples that can be extended upon fairly simply given some time. Some ideas we considered were contours, weighting a face, and possibly a mixture of these.

Contours, as described above, were briefly looked at and we did not pursue it at this time. By simply taking contours naively, the result is actually very noisy (for example, we get contour lines for individual teeth and creases on the face) and it is very difficult to ascertain the expression that is being made by the face. Even if we decided to just look at the mouth, it is actually pretty hard to see where the mouth is even located! However, looking at contours does have promise if we are able to identify the contour that represents the mouth area.

Another option that could provide better results is to weight certain parts of the image as more informative than others. For example, much of these images are heavily based on what the mouth is doing. Also, we know that the mouth is on the bottom half of the face (at least for our images). With these two pieces of information, we can take each face and enhance the bottom half area of the face before using it as part of the classifier, or in the extreme case, simply cut off the top half of the image. This would then focus on just the differences in the mouth. This would be our next step if we had the time.Lastly, we only looked at two classification methods, KNN and SVM, a few variations of a database and tuning parameters (such as number of eigenfaces, weights, K, etc.). There are possibly other classifiers that can perform this form of classification much better than either KNN or SVM. On the other hand, SVM has many tunable parts which might enhance SVMs ability to classify the images more accurately. Furthermore, our facial database might not have been best suited for identifying each expression clearly. By adjusting the images we used in our database to train our classifiers and tuning some weights, we were able to get much better results. This would indicate that with the right choice of images and parameters, the system would be much more robust to wide variations of expressions.Viola-Jones Face Detector

Classifier

Output: Happy

Trained with ~1500 images

Facial Expression Recognition in Static Images

Documents

Transcript of Facial Expression Recognition in Static Images