Home - Computer & Information Science & …dihong/assets/07094272.pdfX. Li is with the Center for...

2736 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 9, SEPTEMBER 2015

Learning Compact Feature Descriptor and AdaptiveMatching Framework for Face Recognition

Zhifeng Li, Senior Member, IEEE, Dihong Gong, Xuelong Li, Fellow, IEEE, and Dacheng Tao, Fellow, IEEE

Abstract— Dense feature extraction is becoming increasinglypopular in face recognition tasks. Systems based on thisapproach have demonstrated impressive performance in arange of challenging scenarios. However, improvements indiscriminative power come at a computational cost and with arisk of over-fitting. In this paper, we propose a new approach todense feature extraction for face recognition, which consists oftwo steps. First, an encoding scheme is devised that compresseshigh-dimensional dense features into a compact representationby maximizing the intrauser correlation. Second, we develop anadaptive feature matching algorithm for effective classification.This matching method, in contrast to the previous methods,constructs and chooses a small subset of training samples foradaptive matching, resulting in further performance gains.Experiments using several challenging face databases, includinglabeled Faces in the Wild data set, Morph Album 2, CUHKoptical-infrared, and FERET, demonstrate that the proposedapproach consistently outperforms the current state of the art.

Index Terms— Face recognition, feature descriptor, LFW.

I. INTRODUCTION

FEATURE representation and feature matching are two keystages in face recognition. The former refers to processes

that derive appropriate feature descriptors, such as SIFT [2]and LBP [3], while the latter uses a classification model,usually trained on a separate dataset, to match facial imagesto particular subjects.

Manuscript received June 29, 2014; revised October 16, 2014 andJanuary 26, 2015; accepted March 22, 2015. Date of publication April 24,2015; date of current version May 19, 2015. This work was supported in partby the National Natural Science Foundation of China under Grant 61103164and Grant 61125106, in part by the Natural Science Foundation of GuangdongProvince under Grant 2014A030313688, in part by the Key Laboratory ofHuman-Machine Intelligence-Synergy Systems through the Chinese Academyof Sciences, in part by the Guangdong Innovative Research Team Programunder Grant 201001D0104648280, in part by the Key Research Programthrough the Chinese Academy of Sciences under Grant KGZD-EW-T03, andin part by the Australian Research Council under Project DP-120103730,Project DP-140102164, Project FT-130101457, and Project LP-140100569.The associate editor coordinating the review of this manuscript and approvingit for publication was Mr. Pierre-Marc Jodoin.

Z. Li and D. Gong are with the Shenzhen Key Laboratory of ComputerVision and Pattern Recognition, Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences, Shenzhen 518055, China (e-mail: [email protected]; [email protected]).

X. Li is with the Center for OPTical IMagery Analysis and Learning (OPTI-MAL), School of Computer Science, Northwestern Polytechnical University,Xi’an 710129, Shaanxi, P. R. China (e-mail: [email protected]).

D. Tao is with the Centre for Quantum Computation & Intelligent Systemsand the Faculty of Engineering and Information Technology, University ofTechnology, Sydney, 81 Broadway Street, Ultimo, NSW 2007, Australia(e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2426413

In practice, either handcrafted descriptors (such as SIFT [2]and HOG [4]) are used as features, or the features are learnedfrom a training set, such as Fisher Vectors Faces [5], LE [6],PS [53], and a common feature descriptor [55]. Dense featureextraction relies on densely sampling features at multiplescales to describe faces. It has recently received increasedattention in the face recognition research community dueto its expressive power and high discriminative performancecompared to traditional methods [1]. However, super high-dimensional facial features (e.g., 100K-dimensional) are com-putationally expensive to handle and can require substantiallylarger training datasets to reduce the risk of over-fitting. Theprimary goal of this work is to develop a new representationthat inherits the strong discriminative power of dense featureextraction while maintaining a compact size.

The feature matching models used in practice fall into twocategories: discriminative models [7]–[13], [47], [48], [51],[52], [54] and generative models [14]–[16]. In particular, thegenerative approaches, such as those described in [14]–[16],construct generative models based on a variety of indepen-dence assumptions and can be used for classification followingBayesian rules. On the other hand, discriminative approachesfocus more directly on the classification task and thus can yieldsuperior performance over the generative methods; represen-tative methods include [7]–[13], [47], [48], [51], [52], [54].Discriminative models can be further classified into linear andnonlinear subtypes. The linear methods are often simple androbust but lack the capability to express nonlinear variations,while nonlinear methods overcome this limitation at the costof more expensive training and a higher risk of over-fitting.

For the purpose of face representation, here we propose acompression encoding scheme based on maximum correlationcriteria. This scheme effectively converts high-dimensionaldense features into a much more compact representation.Furthermore, we propose a new face matching method, calledthe ‘Adaptive Matching Framework’, and conduct experimentsin four different face recognition scenarios: face recognition inthe wild, aging face recognition, and matching near-infraredface images and optical face images, and the FERET test.Representative images used in the experiments are shownin Figure 1.

The major contributions of this paper are as follows.(1) An effective compression encoding scheme based onmaximum correlation criteria is proposed. When combinedwith dense face descriptors, this scheme is able to producehighly discriminant, yet very compact, descriptors. Thisis supported by thorough experimentation. (2) Based on

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LI et al.: LEARNING COMPACT FEATURE DESCRIPTOR AND ADAPTIVE MATCHING FRAMEWORK 2737

Fig. 1. Sample face images from the (a) LFW dataset, (b) Infrared-OpticalFace dataset, and (c) Morph Album 2 dataset.

this feature descriptor, a novel feature matching method,termed the Adaptive Matching Framework, is proposed andwhich further improves performance; (3) Experiments areconducted to demonstrate that the proposed approach obtainsa state-of-the-art result in challenging settings. We alsodemonstrated that the proposed approach is generalizable toother large-scale face databases.

II. A BRIEF REVIEW OF THE PREVIOUS LOCAL FEATURE

DESCRIPTORS IN FACE RECOGNITION

Although global appearance based features have beenwidely used for face representation, it is now generally agreedthat local image descriptors are more effective for face rep-resentation. Representative local features for face recognitioninclude Gabor wavelets [39], local binary patterns (LBP) [3],and scale invariant feature transform (SIFT) [2]. Comparedto the global appearance features, the local features inherentlypossess spatial locality and orientation selectivity. These prop-erties allow local feature representations to be robust againstvariations to aging, illumination, and expression. Recently,many improvements and variants of these methods have beenproposed to improve the face recognition performance. Typicalexamples include [40]–[45], [55].

III. OUR FEATURE DESCRIPTOR

High-dimensional facial features generally result in highrecognition performance [1]. High-dimensional facial featurescan be extracted using both dense sampling landmarks [17]and multi-scale [18] techniques, as shown in Figure 2.High-dimensional facial features contain much more infor-mation than low-dimensional ones, which is important forboosting recognition performance, but this benefit comes at theexpense of computational complexity. For example, projecting100K-dimensional facial features to 1K-dimensions requires100M floating point multiplications, even if using a linearprojection such as Principle Component Analysis (PCA) [19].

We therefore propose an effective compression encodingmethod that turns dense features into a compact feature repre-sentation, while at the same time enhancing the discriminativepower. This is achieved by compressing facial features at thelocal feature level (i.e., local features extracted at sampling

landmarks) by encoding the long local features into a compactfeature vector. In this way, the intra-user correlation ismaximized, as illustrated in Figure 3.

A. Learning Compact Feature Descriptor

Suppose we are given a set of local feature pairsT = {(t1

n , t2n , )|n = 1, ..., N} from N subjects extracted at a

fiducial landmark. Our aim is to then learn an encoding schemethat can turn local features into a compact feature representa-tion whose correlation can be maximized. Mathematically,

(w∗1, w∗2) = arg max(w1,w2)

1N

∑Nn=1 wT

1 t1n t2T

n w2√

1N

∑Nn=1 wT

1 t1n t1T

n w1

√1N

∑Nn=1 wT

2 t2n t2T

n w2

. (1)

Eqn. (1) is a CCA problem that can be solved by fixing thedenominators and maximizing the nominator, as follows:

Maximize :N∑

n=1

wT1 t1

n t2T

n w2 (2)

subject to:∑N

n=1 wT1 t1

n t1T

n w1 = 1 and∑N

n=1 wT2 t2

n t2T

n w2 = 1.The Lagrangian multiplier is introduced to solve the

optimization problem (2), as follows:

L(w1, w2, λ1, λ2) = wT1 C12w2 − λ1

2(wT

1 C11w1 − 1)

− λ2

2(wT

2 C22w2 − 1) (3)

where C11 = ∑Nn=1 t1

n t1T

n , C22 = ∑Nn=1 t2

n t2T

n andC12 =∑N

n=1 t1n t2T

n .By taking derivatives w.r.t. w1 and w2, the K.K.T. conditions

can be obtained: C12w2 = λ1C11w1 and CT12w1 = λ2C22w2.

By multiplying the first condition with wT1 and the second

condition with wT2 , we arrive at λ1 = λ2 = λ, where

λ=∑Nn=1 wT

1 t1n t2

n w2/

(√∑Nn=1 wT

1 t1n t1T

n w1

√∑Nn=1 wT

2 t2n t2T

n w2

)

represents the correlation to be maximized in (1).Substitute λ1 and λ2 with λ, and the K.K.T. condition can

be written compactly as:

λCD

[w1w2

]

= CO

[w1w2

]

(4)

where CD =[

C11 00 C22

]

and CO =[

0 C12

CT12 0

]

. (w∗1, w∗2)

can be restored from the eigenvector of (4) with the largesteigenvalue. The first pair of (w∗1, w∗2) gives a CCA projection,and more projections can be obtained by taking eigenvectorsfrom (4), in descending order of eigenvalues. In our system,we take 12 leading eigenvectors and thus each local feature isencoded into a 12D feature vector.

B. Multi-Feature Encoding

Face recognition is a complicated task that requiresefficient handling of complex variations in facial appearancecaused by a range of factors, such as changes in lighting


Fig. 2. An illustration of the multi-scale and dense sampling techniques used for generating high-dimensional facial features. Landmark points (blue) aredensely allocated, on which local features (such as HOG or SIFT) apply. By concatenating the local features from all the landmark points over all the scales,a redundant feature representation for a face image can be obtained.

Fig. 3. (a) The feature compression scheme. At the training stage, a CCA-based encoder is trained for each fiducial landmark by preserving the leading12 eigenvectors corresponding to the largest eigenvalues. During encoding, the dense features are encoded using the learned projection models, and the resultingoutputs are then concatenated to form a compact feature representation. (b) For enhanced feature representation, multiple features (e.g., HOG and LBP)at different scales (e.g., s1, s2, s3) are extracted and compressed, followed by concatenation at the block, scale, and feature levels to create the finalrepresentation.

conditions, expression, and aging effects. It is difficult toaddress all variations using a single descriptor and, therefore,combining multiple features is common in practice. Followingthis practice, we apply our compression encoding scheme todifferent face descriptors over multiple scales. Specifically,in our experiments we use three scales with scaling factor√

2, i.e., if the cropped image size is 100 × 100, then thethree scales correspond to 100× 100, 71× 71, and 50× 50,respectively. The extracted features are then concatenated toform the final feature representation, which remains quitecompact and expressive.

C. Discussions

Our compression encoding scheme has advantages overexisting schemes in terms of efficiency and efficacy. First,the feature dimension is significantly reduced, and the com-putational cost is largely saved accordingly. For example, ourapproach can result in a 12-times reduction in the dimension ofSIFT features and a six-times reduction in LBP features. Thememory cost is exactly proportional to the length of features,which implies 12 times less memory cost for SIFT and 6-timesless memory cost for LBP. For time complexity, since we arecompressing the features at local features level (e.g. dimensionof local features is 128 for SIFT), instead of global features

level (e.g. dimension of face features is usually up to 50K), ourmethod can improve the computation efficiency significantly,given that running global dimension reduction (e.g. PCA)usually has time complexity around O(n^3). Thus, if wereduce the n value by 10 times, then our method wouldrun 1,000 times faster than the global dimension reductionmethods like PCA. In addition, notice that the featuredimension of our feature representation is very low, the timecomplexity of our method would be very low accordingly.

Second, the discriminative information is properly exploredfor improving the performance. Of note, the discriminantinformation often lies in mutual information in multiple faceimages, modeled by the intra-user correlation, and thereforeenhancing the intra-user correlation improves discriminantpower. Our experiments show that compressed facial featuresprovide superior recognition than their original counterparts.

Finally, our approach is easy to implement and can readilybe combined with any existing feature descriptor, thus leadingto a more compact and expressive feature descriptor.

IV. THE ADAPTIVE MATCHING FRAMEWORK

Due to the intrinsically complicated structure of thehuman face, facial features usually span a complicatedstructure, the nonlinearly clustered structures of which cannot


Fig. 4. The adaptive matching framework. At the training stage (a), a data point (purple) that has not been used for training is randomly sampled. K pairs ofdata points (yellow) are then incorporated from their nearest neighbors to form a training subset, based on which discriminative subspace analysis is applied.This process is repeated until all the data points have been selected at least once. At the testing stage (b), only a portion of all the sub-classifiers become“active classifiers” (the purple points are at the center of the samples used for training of that classifier) according to the nearest distances to testing samples(red point u and green point v represent verification face pairs).

easily be captured using traditional linear projection-basedclassifiers [10], [20]. A number of approaches have beenproposed to address this issue. In [21], the authors reducedthe nonlinear effect by slicing long facial features into manyshorter local features, based on which linear discriminantclassifiers are trained. In [10] and [20], kernel-based nonlinearmethods are proposed to address this problem, but werelimited by the high risk of over-fitting, especially with smalltraining sets.

In this section, we propose a new method called theAdaptive Matching Framework, which addresses the aboveproblems using a sampling technique. Our approach is basedon the following two observations: 1) that difficult samplesare better classified using a classification model trained fromnearby samples; and 2) samples in the local region of anonlinear structure are linear and can therefore be efficientlyclassified using existing linear classifiers, thereby avoiding thekernel space and its inherently high risk of over-fitting.

The construction of the Adaptive Matching Framework canbe divided into two steps. In the first step, a series of trainingsubsets is constructed by repeatedly sampling the training data.In the second step, a series of linear sub-classifiers is trainedbased on the training subsets generated in the first step andcombined to form a powerful decision.

A. Data Sampling

Here we describe the detailed procedure for constructingtraining subsets from an entire training dataset. As shown inFigure 4, we require the sampled data points to lie in a localregion, such that they have a locally linear property. To achievethis, a sample is first randomly selected from the entire trainingdataset and then its K-nearest neighbors are incorporated toform a random subset. The crux here is how to identify theneighbors of a specific data point, given only the input spacedistances.

Suppose we are given a set of training feature pairsX = {(x1

i , x2i )|i = 1, ..., N}, where (x1

i , x2i ) is the feature pair

from the i-th training person. The local structure of the trainingdata points X is first constructed by connecting each datapoint to its five nearest neighbors measured by their Euclideandistances. These can be represented as a weighted graph G,and edges are added to connect any disjointed components.In the second step, we measure the geodesic distance dM (p, q)between vertex p and q in G by computing their shortest pathdistance, as illustrated in Figure 4.

During the sampling process, in order to ensure that theentire training dataset can be sampled completely and featurepairs from the same subject can be sampled at the same time,the random training subsets are constructed as follows:

a) Initialize a sampling indicator t = [0...0] ∈ R1×2N .b) Randomly select an index k ∈ [1, 2N] such that

t (k) = 0.c) Take the M-nearest neighbors from vertex k in G such

that the selected faces cover exactly K different subjects.d) Incorporate data points that are not in the M-nearest

neighbors of vertex k but their corresponding points arein the M-nearest neighbors (i.e. x1

j not in the selectedneighbors set, but x2

j in the neighbor set) to form atraining subset contains exactly K training feature pairs.

e) Update t : t ( j)← 1 where e j = {selected data points}.Go to step (b) until t (k) = 1 for k ∈ [1, 2N].

B. Learning Local Linear Sub-Classifiers

Using the proposed sampling technique, we can obtain anensemble of training subsets. Let’s denote the j-th trainingsubset as X j = {(x1

p, x2p)|p = 1, ..., K }. We then apply the

local feature-based discriminant analysis approach [21] on X j .The details of the algorithm are as follows:

a) Divide the long features into several slices of equallength. The number of slices is set to 15 in our system.

b) For the q-th sliced features, use the unified sub-space analysis method [22] to train a linear model,whose projection matrix and mean vector are denoted


as Pjq and M jq , respectively. Here the subscriptjq indicates the local linear model trained on the q-thslice of the j-th training subset.

C. Adaptive Face Matching

This paper considers both face identification and verificationtasks. For face identification, a gallery of faces and a probeface are given and the task is to identify the matching face inthe gallery. In our system, the face with the highest matchingscore to the probe face identifies the best matching face in thegallery. The matching score from probe face p to the galleryface g is computed as follows:

a) Extract the compressed facial representations using theproposed method given in section III.

b) Adaptively select some sub-classifiers (adaptive to eachprobe sample) trained in section IV.A and IV.B, denoting theindex of the selected classifiers as Y. In the selection of thesub-classifiers, those that are trained using the samples nearthe probe face tend to be selected, such that the probe face isbetter classified. Based on this rationale, we select a portionof sub-classifiers whose mean of the used training samples hasthe minimum distance to the probe face.

c) Compute the matching score using a simple summariza-tion rule:

score =∑

j∈Y

15∑

q=1

f Tp jq

fg jq∥∥ f p jq

∥∥

∥∥ fg jq

∥∥

where f p jq in the context of face recognition is subspacerepresentation for the probe face using q-th slice projectionof the j-th selected classifiers, and similarly for fg jq . In thecontext of face verification, f p jq and fg jq represents a pair offaces to be verified.

The object of face verification is to determine whether apair of given faces is from the same person or not. We firstcompute the matching score of the given face pair using themethod described above (note that the only difference hereis that in step (b) we compute the distances from mean toboth faces rather than only the probe face), and then comparethe score with a threshold b. The given face pair is said tobe matching if the score is greater than b, otherwise they donot match. The threshold b is determined by maximizing theverification accuracy on the training dataset.

D. Discussion

The proposed adaptive matching method is a simple yeteffective method for general face recognition tasks. Its meritsare two-fold: first, by combining locally linear sub-classifiers,our approach can account for nonlinear variations withoutresorting to kernel tricks, which are computationally expensivefor large datasets, difficult to tune, and prone to over-fitting.Second, by selecting a small subset of training samples foradaptive face matching, we can improve accuracy, sincesamples falling in that vicinity usually better reflect thestatistical characteristics of a given testing sample. Theexperimental results presented in the next section suggestthat the proposed method consistently outperforms otherstate-of-the-art methods in several face recognition scenarios.

V. EXPERIMENTS

We demonstrate the effectiveness of our approach infour popular face recognition scenarios: matching facesin the wild, cross-age face recognition, near-infrared facerecognition, and the standard face recognition task. Large-scale experimental validations are conducted on several largeand challenging face databases: the “Labeled Faces in theWild dataset” (LFW) [23], the Morph Album 2 database (thelargest public domain face aging database) [24], the CUHKInfrared-Optical dataset, and the FERET dataset [49].

A. Datasets

The LFW Dataset [23]: The LFW database contains13,233 images from 5,749 different subjects, with each facelabeled with the name of the person pictured. The number ofimages for each subject varies from 1 to 530. All the imagesare collected from the Internet and have large intra-personalvariability. In our experiments, we followed the image-restricted protocol, with no outside training data used. Forevaluation, the dataset was randomly divided into 10 disjointsplits, with each containing 300 pairs of matching images and300 pairs of unmatched images (the matching and unmatchedpairs are provided by the LFW face dataset for benchmark).Samples from this dataset are shown in Figure 1(a).

CUHK Infrared-Optical Dataset: The CUHK Infrared-Optical face dataset consists of both optical and infraredphotos from 2800 different people, each having one opticalphoto and one corresponding infrared photo. In our exper-iments, we divided the dataset into two: 1400 people fortraining and 1400 people for testing. Samples from this datasetare shown in Figure 1(b).

Morph Album 2 Dataset: The Morph Album 2 dataset [24]is the largest face aging dataset available in the publicdomain. This dataset is composed of about 78,000 face imagesfrom 20,000 different subjects captured at different ages,ranging from 16 to 77. Samples from this dataset are shownin Figure 1(c).

FERET Dataset: The FERET dataset is a standard facedatabase [49]. It has several subsets. Fa subset (gallery set)has frontal images of 1196 persons. Fb subset has 1195 faceimages with expression variations. Fc subset has 194 faceimages with lighting variations. Dup 1 has 722 face imageswith an age gap to the gallery set. Dup 2 is a subset of Dup 1,containing 234 face images. There is a standard training setfor FERET database. From that training dataset we selected400 persons with each one having two face images as thetraining data.

B. Preprocessing and Parameter Settings

Prior to feature representation, the face images werepre-processed, as shown in Table I.

Several feature descriptors were used as benchmarks andtheir configurations are shown in Table II.

The parameters used for the Adaptive Matching Frameworkare shown in Table III. K denotes the size of the trainingsubsets (K Nearest Neighbors) described in Section IV.A,


TABLE I

IMAGE PREPROCESSING

TABLE II

FEATURE DESCRIPTOR CONFIGURATIONS

TABLE III

THE PARAMETERS IN THE ADAPTIVE MATCHING FRAMEWORK

and ‘active sub-classifiers’ denotes the percentage of sub-classifiers selected to compute the matching scores describedin Section IV.C. Note that the parameters K and b(see section IV.C) were determined by experimental validationusing 30% of the training data. Specifically, we increased theK value gradually and for each K value, we determined anoptimal b value that achieved the highest verification accuracyon the validation data. The optimal K and b values were thusdetermined in such a way that the optimal verification accuracycan be achieved. In this study, K is set to be 600 and b is setto be 2.865.

C. Matching Faces in the Wild

In this experiment, we studied the verification performanceof our approach on the LFW dataset using the image-restrictedprotocol (the most restricted protocol). The face images werefirst pre-processed as described in Section V.B, and theperformance of the proposed approach was then measured byperforming 10-fold cross-validation. Experiments were strictlyindependent for each fold cross-validation, and the averagedresults are reported.

In this experiment, the HOG and SIFT features wereadopted (see configurations in Section V.B), which were

Fig. 5. Verification performance comparison on the LFW dataset under therestricted setting.

TABLE IV

VERIFICATION ACCURACY COMPARISON ON THE LFW

DATASET UNDER THE RESTRICTED SETTING

encoded into a compact feature representation using theencoding scheme described in Section III.

In Figure 5, we plot the ROC curves of our method and thestate-of-the-art ones. Our method can obtain a state-of-the-artresult on the LFW database.

Note that a very recent work [50] has achieved the bestverification accuracy currently on the LFW database underthe restricted setting. We also compare the verificationaccuracy of our approach against the state-of-the-art methods(including the most recent work [50]). The verificationaccuracy was the averaged accuracy of 10-fold cross-validation, each with 300 positive pairs and 300 negativepairs for testing. The comparative results are reportedin Table IV. Our approach outperforms most of the existingmethods. Of note, the proposed method does not rely onvery high-dimensional facial features (of importance whenconsidering computational efficiency). The dimensionality ofour features is approximately 20K, compared to the facialfeatures presented in [5], which has approximately 332800Dper face image and 67584 after PCA dimension reduction.In addition, our approach is far more computationally efficientthan the top-performing method (MRF-Fusion-CSKDA [50]).


TABLE V

COMPARISON OF THE TESTING TIME OF OUR APPROACH WITH

MRF-FUSION-CSKDA [50] ON THE SAME COMPUTER

To verify this point, we conduct an experiment to compare thetesting time of our approach against the MRF-Fusion-CSKDAmethod [50] on the same computer. It is encouraging to seethat our approach performs much more efficiently than [50].

D. Cross-Age Face Recognition

The Morph Album 2 dataset is the largest publicly availableface aging dataset. Large-scale experimental validation isnecessary when evaluating our approach on this dataset.Following the settings for the training and testing splitpresented in [33], all 20,000 persons in the dataset were usedin this experiment. We partitioned the Morph Album 2 datasetinto a training set and an independent test set: the trainingdata consisted of 20,000 face images from 10,000 subjects,with each subject having two images representing the largestage gap, and the test data were composed of gallery andprobe sets from the remaining 10,000 subjects. The galleryset was composed of 10,000 face images corresponding tothe youngest age of these 10,000 subjects, while the probeset was composed of 10,000 face images corresponding tothe oldest age of these 10,000 subjects.

For this experiment, the HOG and SIFT features wereadopted, which were then compressed into a compact featurerepresentation using the developed encoding scheme describedin Section III.

We compare our approach to several state-of-the-artmethods for age invariant face recognition, including:(i) FaceVACS, a leading commercial face recognitionengine [26]; (ii) several newly developed generativemethods [15], [16] for face aging; (iii) several newlydeveloped discriminative methods [33]–[35] for direct ageinvariant face recognition; and (iv) Deep Learning Model [46].Comparative results are reported in Table V. All the methodspresented in Table VI were tuned to the best settings accordingto their respective papers. For the deep learning model [46](also named as DeepID2), we are using the same trainingdata (CelebFace+ database) as [46], because the MORPHtraining dataset (each person has only two face images) isnot suitable for learning the deep model. So the currentrecognition performance of the deep learning model [46] onthe MORPH testing is not very good.

It is encouraging to see that the proposed approach outper-forms the other methods in Table VI. This shows the goodgeneralization ability of our approach.

E. Near-Infrared Face Recognition

The object of infrared-based ARF systems is to match probeface images taken with a infrared devices to a gallery offace images taken with optical devices. This is an important

TABLE VI

RANK-1 IDENTIFICATION COMPARISON ON THE MORPH ALBUM 2

TABLE VII

RANK-1 IDENTIFICATION COMPARISON ON THE CUHK

INFRARED-OPTICAL FACE DATASET

heterogeneous face recognition application (also known ascross-modality face recognition), the most challenging aspectof which is that face images (from the same person) takenusing different devices may be mismatched due to large dis-crepancies between different modalities (optical and infrared);this is referred to as the modality gap. As shown in Figure 1,the infrared photos are usually blurred, of low contrast, andhave significantly different gray distribution compared to theoptical photos.

In this experiment, MLBP and SIFT features wereadopted, which were then compressed into a compact featurerepresentation using our compression encoding scheme.

We compare our approach to the following state-of-the-artalgorithms for heterogeneous face recognition: Coupled-Information Tree Encoding (CITE) [36], RandomizedLDA (RLDA) [37], Local Feature-based Discriminant Analy-sis (LFDA) [21], Coupled Discriminant Analysis (CDA) [38],and the method in [44]. These algorithms were tuned to theiroptimal settings according to their respective papers. Theresults of comparisons are shown in Table VII.

The proposed approach outperforms the other methodsin Table VII. It should be noted that most of the othermethods presented in Table VII are designed specificallyfor heterogeneous face recognition, and as a consequenceare carefully tuned for the heterogeneous face recognitionproblem. In contrast, our approach is a general face recognitionframework. In spite of this, our approach delivers impressiveresults and confirms its effectiveness and generalizability.

F. FERET Experiment

In this experiment, we compare our approach against thestate-of-the-art feature based methods on the FERET database.


TABLE VIII

RANK-1 IDENTIFICATION COMPARISON ON THE FERET DATASET (Fb)

TABLE IX

RANK-1 IDENTIFICATION COMPARISON ON THE FERET

DATASET (Fc, DUP1, AND DUP 2)

TABLE X

COMPARISON OF OUR COMPACT FEATURE DESCRIPTOR

WITH THE STATE-OF-THE-ART

In this study, we use the Fa subset as the gallery set, anduse the largest probe subset (Fb) for performance evaluation.The comparative results are reported in Table VIII. From theresults, we can see that our approach achieves the comparativeresult with the top-performing method [40] on the largestprobe subset (Fb).

To better evaluate the performance of our approachon FERET, we further compare our approach against thetop-performing method [40] on the other subsets (Fc, Dup1,and Dup 2). The comparative results are reported in Table IX.Our approach slightly outperforms the method in [40].

G. Additional Experiment

Note that our approach has two key components: the com-pact feature descriptor and the adaptive matching framework.It is therefore desirable to explore the effectiveness of eachcomponent. We compared our compact feature representation(compressing the original features into a compact anddiscriminant feature vector) to the original features obtainedby the state-of-the-art feature descriptors, and the resultsare presented in Table X. To ensure a fair comparison, weused the same matching method (the proposed AdaptiveMatching Framework; see Section IV.C). Our compactfeature representation achieves superior performance anddemonstrates the advantages of our compact feature descriptor.

We then compared our adaptive matching framework withits non-adaptive version in Table XI. Unlike the adaptive

TABLE XI

PERFORMANCE EVALUATION OF THE PROPOSED

ADAPTIVE MATCHING FRAMEWORK

matching framework, its non-adaptive version uses all thetraining samples to learn the subspace model for classification.Our adaptive matching framework consistently outperforms itsnon-adaptive version across the three datasets, demonstratingthe effectiveness of the proposed adaptive scheme.

VI. CONCLUSION

In this paper, we proposed a highly compact and discrimi-nant feature descriptor and the Adaptive Matching Frameworkfor enhanced face recognition. The merits of the proposednew approach are: (i) it is easy to implement; (ii) it signif-icantly reduces the dimension of face feature representation,with resulting improved computational efficiency withoutsacrificing recognition performance; (iii) it achieves superiorperformance to current state-of-the-art methods acrossdifferent face recognition scenarios; (iv) it is a generalapproach that can be easily incorporated into many existingmethods to further boost performance.

REFERENCES

[1] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,”in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,Jun. 2013, pp. 3025–3032.

[2] D. G. Lowe, “Object recognition from local scale-invariant features,” inProc. IEEE Int. Conf. Comput. Vis., Sep. 1999, pp. 1150–1157.

[3] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition with localbinary patterns,” in Proc. ECCV, 2004, pp. 469–481.

[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit., Jun. 2005, pp. 886–893.

[5] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Fishervector faces in the wild,” in Proc. BMVC, 2013, pp. 8.1–8.12.

[6] Z. Cao, Q. Yin, X. Tang, and J. Sun, “Face recognition with learning-based descriptor,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.Pattern Recognit., Jun. 2010, pp. 2707–2714.

[7] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfacesvs. Fisherfaces: Recognition using class specific linear projection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720,Jul. 1997.

[8] B.-K. Bao, G. Liu, R. Hong, S. Yan, and C. Xu, “General subspacelearning with corrupted training data via graph embedding,” IEEE Trans.Image Process., vol. 22, no. 11, pp. 4380–4393, Nov. 2013.

[9] Z. Lai, Y. Xu, J. Yang, J. Tang, and D. Zhang, “Sparse tensordiscriminant analysis,” IEEE Trans. Image Process., vol. 22, no. 10,pp. 3904–3915, Oct. 2013.

[10] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognitionusing kernel direct discriminant analysis algorithms,” IEEE Trans.Neural Netw., vol. 14, no. 1, pp. 117–126, Jan. 2003.


[11] M.-H. Yang, “Kernel eigenfaces vs. kernel Fisherfaces: Face recognitionusing kernel methods,” in Proc. 5th Int. Conf. Face Gesture Recognit.,May 2002, pp. 215–220.

[12] P. Li, Y. Fu, U. Mohammed, J. H. Elder, and S. J. D. Prince, “Proba-bilistic models for inference about identity,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 1, pp. 144–157, Jan. 2012.

[13] R. Singh, M. Vatsa, A. Ross, and A. Noore, “Online learning inbiometrics: A case study in face classifier update,” in Proc. BTAS,Sep. 2009, pp. 1–6.

[14] F. Cardinaux, C. Sanderson, and S. Bengio, “Face verification usingadapted generative models,” in Proc. IEEE Int. Conf. Autom. FaceGesture Recognit., May 2004, pp. 825–830.

[15] U. Park, Y. Tong, and A. K. Jain, “Age-invariant face recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 947–954,May 2010.

[16] J.-X. Du, C.-M. Zhai, and Y.-Q. Ye, “Face aging simulation based onNMF algorithm with sparseness constraints,” Neurocmputing, vol. 116,no. 20, pp. 250–259, 2012.

[17] C. Ma, X. Yang, C. Zhang, X. Ruan, and M.-H. Yang, “Sketch retrievalvia dense stroke features,” in Proc. BMVC, 2013, pp. 65.1–65.11.

[18] T. Mäenpää and M. Pietikäinen, “Multi-scale binary patterns for textureanalysis,” in Proc. SCIA, 2003, pp. 885–892.

[19] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,”in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,Jun. 1991, pp. 586–591.

[20] C. Liu, “Gabor-based kernel PCA with fractional power polynomialmodels for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 26, no. 5, pp. 572–581, May 2004.

[21] B. F. Klare, Z. Li, and A. K. Jain, “Matching forensic sketches to mugshot photos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 3,pp. 639–646, Mar. 2011.

[22] X. Wang and X. Tang, “A unified framework for subspace facerecognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9,pp. 1222–1228, Sep. 2004.

[23] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeledfaces in the wild: A database for studying face recognition in uncon-strained environments,” Univ. Massachusetts Amherst, Amherst, MA,USA, Tech. Rep. 07-49, Oct. 2007.

[24] K. Ricanek and T. Tesafaye, “MORPH: A longitudinal image databaseof normal adult age-progression,” in Proc. 7th Int. Conf. Autom. FaceGesture Recognit., Apr. 2006, pp. 341–345.

[25] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller, “Learningto align from scratch,” in Proc. Adv. Neural Inf. Process. Syst., 2012,pp. 773–781.

[26] FaceVACS Software Developer Kit. Cognitec Systems GbmH. [Online].Available: http://www.cognitec.com/

[27] J.-G. Wang, J. Li, W.-Y. Yau, and E. Sung, “Boosting dense SIFTdescriptors and shape contexts of face images for gender recognition,”in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.Workshops (CVPRW), Jun. 2010, pp. 96–102.

[28] E. Nowak and F. Jurie, “Learning visual similarity measures for com-paring never seen objects,” in Proc. Comput. Vis. Pattern Recognit.,Jun. 2007, pp. 1–8.

[29] L. Wolf, T. Hassner, and Y. Taigman, “Descriptor based methods inthe wild,” in Proc. ECCV, Workshop Faces ‘Real-Life’ Images, Detect.,Alignment, Recognit., 2008.

[30] N. Pinto, J. J. DiCarlo, and D. D. Cox, “How far can you get with amodern face recognition test set using only simple features?” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 2591–2598.

[31] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang, “Probabilistic elasticmatching for pose variant face verification,” in Proc. CVPR, Jun. 2013,pp. 3499–3506.

[32] S. R. Arashloo and J. Kittler, “Efficient processing of MRFs forunconstrained-pose face recognition,” in Proc. BTAS, Sep./Oct. 2013,pp. 1–8.

[33] Z. Li, U. Park, and A. K. Jain, “A discriminative model for age invariantface recognition,” IEEE Trans. Inf. Forensics Security, vol. 6, no. 3,pp. 1028–1037, Sep. 2011.

[34] B. Klare and A. K. Jain, “Face recognition across time lapse: Onlearning feature subspaces,” in Proc. Int. Joint Conf. Biometrics (IJCB),Oct. 2011, pp. 1–8.

[35] C. Otto, H. Han, and A. Jain, “How does aging affect facial compo-nents?” in Proc. ECCV Workshop, 2012, pp. 189–198.

[36] W. Zhang, X. Wang, and X. Tang, “Coupled information-theoreticencoding for face photo-sketch recognition,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 513–520.

[37] B. Klare and A. K. Jain, “Heterogeneous face recognition: Match-ing NIR to visible light images,” in Proc. 20th Int. Conf. PatternRecognit. (ICPR), Aug. 2010, pp. 1513–1516.

[38] Z. Lei, S. Liao, A. K. Jain, and S. Z. Li, “Coupled discriminant analysisfor heterogeneous face recognition,” IEEE Trans. Inf. Forensics Security,vol. 7, no. 6, pp. 1707–1716, Dec. 2012.

[39] Á. Serrano, I. M. de Diego, C. Conde, and E. Cabello, “Recent advancesin face biometrics with Gabor wavelets: A review,” Pattern Recognit.Lett., vol. 31, no. 5, pp. 372–381, Apr. 2010.

[40] Z. Chai, Z. Sun, H. Méndez-Vázquez, R. He, and T. Tan, “Gabor ordinalmeasures for face recognition,” IEEE Trans. Inf. Forensics Security,vol. 9, no. 1, pp. 14–26, Jan. 2014.

[41] Z. Chai, H. Méndez-Vázquez, R. He, Z. Sun, and T. Tan, “Semantic pixelsets based local binary patterns for face recognition,” in Proc. ACCV,2013, pp. 639–651.

[42] Z. Lei, R. Chu, R. He, S. Liao, and S. Z. Li, “Face recognition bydiscriminant analysis with Gabor tensor representation,” in Proc. ICB,2007, pp. 87–95.

[43] S. Liao, Z. Lei, D. Yi, and S. Z. Li, “A benchmark study of large-scaleunconstrained face recognition,” in Proc. IJCB, Sep./Oct. 2014, pp. 1–8.

[44] Z. Lei, M. Pietikainen, and S. Z. Li, “Learning discriminant facedescriptor,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2,pp. 289–302, Feb. 2014.

[45] B. Fan, Q. Kong, T. Trzcinski, Z. Wang, C. Pan, and P. Fua, “Receptivefields selection for binary feature description,” IEEE Trans. ImageProcess., vol. 23, no. 6, pp. 2583–2595, Jun. 2014.

[46] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation byjoint identification-verification,” in Proc. NIPS, 2014, pp. 1988–1996.

[47] T.-K. Kim, J. Kittler, and R. Cipolla, “Discriminative learning andrecognition of image set classes using canonical correlations,” IEEETrans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 1005–1018,Jun. 2007.

[48] N. Guan, X. Zhang, Z. Luo, and L. Lan, “Sparse representation baseddiscriminative canonical correlation analysis for face recognition,” inProc. 11th Int. Conf. Mach. Learn. Appl. (ICMLA), vol. 1. Dec. 2012,pp. 51–56.

[49] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERETevaluation methodology for face-recognition algorithms,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1090–1104, Oct. 2000.

[50] S. R. Arashloo and J. Kittler, “Class-specific kernel fusion of multipledescriptors for face verification using multiscale binarised statisticalimage features,” IEEE Trans. Inf. Forensics Security, vol. 9, no. 12,pp. 2100–2109, Dec. 2014.

[51] Z. Li, D. Lin, and X. Tang, “Nonparametric discriminant analysis forface recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 4,pp. 755–761, Apr. 2009.

[52] X. Li, S. Lin, S. Yan, and D. Xu, “Discriminant locally linear embeddingwith high-order tensor data,” IEEE Trans. Syst., Man, Cybern. B,Cybern., vol. 38, no. 2, pp. 342–352, Apr. 2008.

[53] Y. Qiao, W. Wang, N. Minematsu, J. Liu, X. Tang, and M. Takeda,“A theory of phase singularities for image representation and its appli-cations to object tracking and image matching,” IEEE Trans. ImageProcess., vol. 18, no. 10, pp. 2153–2166, Oct. 2009.

[54] B. Zhang and Y. Qiao, “Face recognition based on gradient Gabor featureand efficient kernel Fisher analysis,” Neural Comput. Appl., vol. 10,no. 4, pp. 617–623, 2010.

[55] Z. Li, D. Gong, Y. Qiao, and D. Tao, “Common feature discriminantanalysis for matching infrared face images to optical face images,” IEEETrans. Image Process., vol. 23, no. 6, pp. 2436–2445, Jun. 2014.

Zhifeng Li (M’06–SM’11) received thePh.D. degree from the Chinese University ofHong Kong, in 2006. After that, he was aPost-Doctoral Fellow with the Chinese Universityof Hong Kong and Michigan State Universityfor several years. He is currently an AssociateProfessor with the Shenzhen Institutes of AdvancedTechnology, Chinese Academy of Science. Hisresearch interests include computer vision, patternrecognition, and multimodal biometrics. He servesas a Reviewer for a number of major journals

(e.g., the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINEINTELLIGENCE, the International Journal of Computer Vision, the IEEETRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS

ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, and the IEEETRANSACTIONS ON INFORMATION FORENSICS AND SECURITY).


Dihong Gong received the B.E. degree in electricalengineering from the University of Science andTechnology of China, in 2011. He is currentlyan Inter-Student with the Shenzhen Institutesof Advanced Technology, Chinese Academy ofScience. His current research interests includemachine learning methods for face recognition,image retrieval, and image-based human ageestimation.

Xuelong Li (M’02–SM’07–F’12) is a Full Professor with the Center forOPTical IMagery Analysis and Learning (OPTIMAL), School of ComputerScience, Northwestern Polytechnical University, Xi’an 710129, Shaanxi,P. R. China.

Dacheng Tao (F’15) is a Professor of ComputerScience with the Centre for Quantum Computation& Intelligent Systems, and the Faculty of Engineer-ing and Information Technology in the Universityof Technology, Sydney. He mainly applies statisticsand mathematics to data analytics and his researchinterests spread across computer vision, data sci-ence, image processing, machine learning, neuralnetworks, and video surveillance. His researchresults have expounded in one monograph and 100+publications at prestigious journals and prominent

conferences, such as IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS, ICML,CVPR, ICCV, ECCV, AISTATS, ICDM; and ACM SIGKDD, with severalbest paper awards, such as the best theory/algorithm paper runner up awardin IEEE ICDM’07, the best student paper award in IEEE ICDM’13, andthe 2014 ICDM 10 Year Highest-Impact Paper Award.

Home - Computer & Information Science & …dihong/assets/07094272.pdfX. Li is with the Center for...

Documents

Transcript of Home - Computer & Information Science & …dihong/assets/07094272.pdfX. Li is with the Center for...