Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and...

8
Classifier combination applied to real time face detection and classification David Masip 1 , Marco Bressan 2 , Jordi Vitri` a 1 1 Centre de Visi´ o per Computador (CVC), Dept. Inform` atica Universitat Aut` onoma de Barcelona, Bellaterra, Spain, e-mail : (davidm, jordi)@cvc.uab.es 2 Visual Century Research SL, Barcelona, Spain, e-mail : [email protected] Abstract— In these paper we propose a complete scheme for face detection and classification. We have used a Bayesian classifier for face detection and a nearest neighbor approach for face classification. To improve the performance of the classifier a feature extraction algorithm based on a modified non parametric discriminant analysis has also been imple- mented. The complete scheme has been tested in a real time environment achieving encouraging results. We also show a new boosting scheme based on adapting the features to the misclassified examples, achieving also interesting results. Index Terms— face detection, face recognition, boosting, feature extraction I. I NTRODUCTION As computers become faster and faster, new applica- tions dealing with human faces become possible. Ex- amples of this applications are face recognition applied to surveillance systems, gesture analysis applied to user friendly interfaces, or gender recognition applied to reac- tive marketing. We will propose here a global face recog- nition framework, which has achieved good results in an uncontrolled environment. Usually, working under un- controlled conditions is one of the hardest problems of computer vision, for example in applications where illu- mination presents strong changes, or where we have to deal with objects under unpredictible movements. In the face recognition scheme proposed in this paper we have tested the system in a real environment, with no restric- tions on the pose and illumination, achieving real time sat- isfying performance. Our system can be clearly divided in two different steps: the face detection block and the face recognition block, each trained separately. Recent publications dealing with face detection have shown that one of the best solutions in terms of accuracies and performance is the boosting scheme presented by Viola and Jones [1]. Our solution uses a similar boosting scheme, but we propose the naive Bayes as a classifier. On the other hand our recognition scheme uses the near- est neighbor classifier, combined with the proper feature extraction. To perform this feature extraction different di- mensionality reduction algorithms have been analysed. Dimensionality reduction techniques applied to feature extraction involve important advantages in some pattern recognition tasks. Usually we achieve a compression of the input data, reducing the storage needs. In other cases there is also an improvement of the classification results due to reduction of the noise present in the most part of the natural images. Principal Component Analysis is perhaps one of the most spread dimensionality reduction techniques ([2] and [3]). The goal in PCA is to find a linear projection that preserves the maximum amount of input data variance. Another approach is to take into ac- count the labels of the input data points, and try to find the linear projection that best discriminates the space in different classes. Fisher Linear Discriminant analysis is an example of this kind of techniques [4]. Nevertheless, the algorithm has some limitations, because the dimen- sionality of the resulting space after the projection is up- per bounded by the number of classes, and there is also a gaussian assumption in the input data. In this paper we will introduce the non parametric discriminant analysis al- gorithm [5] that overcomes this drawbacks, and a modifi- cation of the original NDA algorithm. In the next section we introduce the classifier combination techniques, and the adaboost algorithm. We also will sug- gest a modification on the algorithm introducing the fea- ture extraction into the boosting scheme. Then we will focus on the classification engine, and the discriminant analysis used as feature extraction will be presented. We will finalize the paper with a complete description of the experiments performed and the final conclusions. II. CLASSIFIER COMBINATION Sometimes it is possible to improve the performance of a poor classifier by combining multiple instances of it in a more powerful decision rule. In the literature, we can find three different approaches to do it: boosting [6], bagging [7], and random subspace methods [8]. The difference be-

Transcript of Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and...

Page 1: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

Classifier combination applied to real time face detection andclassification

David Masip1, Marco Bressan2, Jordi Vitria1

1Centre de Visio per Computador (CVC), Dept. Informatica Universitat Autonoma de Barcelona, Bellaterra, Spain,e-mail : (davidm, jordi)@cvc.uab.es

2Visual Century Research SL, Barcelona, Spain, e-mail : [email protected]

Abstract—In these paper we propose a complete schemefor face detection and classification. We have used a Bayesianclassifier for face detection and a nearest neighbor approachfor face classification. To improve the performance of theclassifier a feature extraction algorithm based on a modifiednon parametric discriminant analysis has also been imple-mented. The complete scheme has been tested in a real timeenvironment achieving encouraging results. We also show anew boosting scheme based on adapting the features to themisclassified examples, achieving also interesting results.

Index Terms— face detection, face recognition, boosting,feature extraction

I. I NTRODUCTION

As computers become faster and faster, new applica-tions dealing with human faces become possible. Ex-amples of this applications are face recognition appliedto surveillance systems, gesture analysis applied to userfriendly interfaces, or gender recognition applied to reac-tive marketing. We will propose here a global face recog-nition framework, which has achieved good results in anuncontrolled environment. Usually, working under un-controlled conditions is one of the hardest problems ofcomputer vision, for example in applications where illu-mination presents strong changes, or where we have todeal with objects under unpredictible movements. In theface recognition scheme proposed in this paper we havetested the system in a real environment, with no restric-tions on the pose and illumination, achieving real time sat-isfying performance.Our system can be clearly divided in two different steps:the face detection block and the face recognition block,each trained separately. Recent publications dealing withface detection have shown that one of the best solutionsin terms of accuracies and performance is the boostingscheme presented by Viola and Jones [1]. Our solutionuses a similar boosting scheme, but we propose the naiveBayes as a classifier.On the other hand our recognition scheme uses the near-est neighbor classifier, combined with the proper feature

extraction. To perform this feature extraction different di-mensionality reduction algorithms have been analysed.Dimensionality reduction techniques applied to featureextraction involve important advantages in some patternrecognition tasks. Usually we achieve a compression ofthe input data, reducing the storage needs. In other casesthere is also an improvement of the classification resultsdue to reduction of the noise present in the most partof the natural images. Principal Component Analysis isperhaps one of the most spread dimensionality reductiontechniques ([2] and [3]). The goal in PCA is to find alinear projection that preserves the maximum amount ofinput data variance. Another approach is to take into ac-count the labels of the input data points, and try to findthe linear projection that best discriminates the space indifferent classes. Fisher Linear Discriminant analysis isan example of this kind of techniques [4]. Nevertheless,the algorithm has some limitations, because the dimen-sionality of the resulting space after the projection is up-per bounded by the number of classes, and there is alsoa gaussian assumption in the input data. In this paper wewill introduce the non parametric discriminant analysis al-gorithm [5] that overcomes this drawbacks, and a modifi-cation of the original NDA algorithm.In the next section we introduce the classifier combinationtechniques, and the adaboost algorithm. We also will sug-gest a modification on the algorithm introducing the fea-ture extraction into the boosting scheme. Then we willfocus on the classification engine, and the discriminantanalysis used as feature extraction will be presented. Wewill finalize the paper with a complete description of theexperiments performed and the final conclusions.

II. C LASSIFIER COMBINATION

Sometimes it is possible to improve the performance ofa poor classifier by combining multiple instances of it in amore powerful decision rule. In the literature, we can findthree different approaches to do it: boosting [6], bagging[7], and random subspace methods [8]. The difference be-

Page 2: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

tween these algorithms is the way how they combine theclassifiers. The main idea of random subspace methods isto use only a subset of the features of each vector. Dif-ferent classifiers are created using random subspaces bysampling the original data feature space. The final deci-sion rule is usually a majority voting among the differ-ent classifiers. The RSM method is specially useful whendealing with high dimensional data, and reduced trainingsets.Bagging also combines the results of different classifiers,but sampling the training set instead of the feature space.A new classifier is trained using each subset of the train-ing vectors, and then every test vector is classified usingall the classifiers, and a final decision rule is performed(weighted majority voting, etc.). Bagging is often usefulwhen we have outliers in the training set, because the areisolated in some subsets and do not alter the global re-sults.The idea behind boosting is similar as bagging, but theclassifiers are trained in a serialized way. At each train-ing step a new classifier is created. The training samplesare classified using it, and a set of weights are modifiedaccording to the classification results (the weights of themisclassified examples are increased). Then a new classi-fier is trained taking into account these weights. The al-gorithm is iterated a certain number of steps, and the finaldecision rule is a weighted combination of the intermedi-ate classifiers.

A. AdaBoost

In this work, we have chosen a boosting scheme, a vari-ant of the adaboost algorithm [6] (see table I). In the lit-erature it can be found some variants of the original algo-rithm depending on the use of the weights in the learningprocess. One possible approach is to use a classifier thatcan accept weights as inputs, or simply multiply the train-ing vectors for their weights. Another approximation isto use the weights to resample the training set in such away that misclassified examples must have higher proba-bility of being chosen for the next step. We have used thisscheme in our experiments.

B. Boosting using adaptable features

Usually the boosting algorithm uses a fixed set of fea-tures extracted from the input dataX. These features canbe in the most simple case the proper pixel values of theimage, or in other cases the result of some kind of filtering(see [1]) or feature extraction.We also have considered a completely different approach,the idea is to introduce in the boosting process the featureextraction, in such a way that at every fixed rate of boost-ing steps the features per each image are recomputed. To

TABLE I

General adaboost algorithm.

Given the training samples X1, X1, . . . Xn

1. Initialize the weight vector to 1 ∀Wi=1,...n.2. For s = 1, 2, . . . S do:• Resample the training data samples X according their actual

weights, obtaining the samplesH.• Train a classifier Cs(Hy).• Classify the samplesX using Cs

• Compute the probability of the error of classification taking into ac-count the weights as follows:

εs =1

N

NXi=1

Wsi ξ

si (1)

where

ξsi =

�1, ifXi was wrongly classified in the step s0, otherwise

• Compute ζs as :

ζs =1

2log� 1− εs

εs

�(2)

• If εs < 0.5 set W s+1i = W s

i exp(ζsξsi ) for each training vector

i, and then normalize the weights in a way thatPN

i=1 qs+1i = n.

Otherwise restart the algorithm3. Finally the classifiers obtained are combined using a weighted ma-

jority voting using the coefficients ζs that encode a measure of theerror in each step. The final decision rule for each new test vector is:

O(x) =X

b

ζsLs > 0 (3)

where L is the label obtained in the classifier s, L ∈ (−1, 1).

compute the new features, the weights of the training vec-tors are used, so every time the features are more focusedin the more difficult examples.In our experiments, we have used a variant of the NonNegative Matrix Factorization algorithm (see [9] and [10]for more details) to recompute the features and our classi-fier was a single layer perceptron.The basic NMF technique found by Lee and Seung triesto find parts in the training objects, providing a part-basedset of bases. NMF also minimizes the mean square errorcriteria as PCA, but the key point of the NMF algorithmis the use of the non negativity constraints applied to boththe set of bases and coefficients that represent each vectorin the reduced space. This non negativity constraint pro-vides always sparse bases due to the fact that the weightedcombination of the bases is always additive, so bases cannot have big regions active because once a big portion ofan image is used as a base, it can not be removed (due tothe constraint of additivity).The general formulation for the NMF algorithm consist ofgiven a set ofn m-dimensional data pointsX (represent-ing positive valued pixel images), find the non negativebasesW that satisfy:

Xij ≈ (W ∗H)ij =D∑

d=1

WidHdj (4)

Usually we will use a number of basesr < n, to ob-tain an important dimensionality reduction, and a selec-tion of the most useful features. In the case of face imagesit can be shown that the non negativity constraints achieve

Page 3: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

bases which represent specific parts of faces such as eyes,mouth, etc . . . To find this set of bases and coefficients thefollowing update rules should be iterated:

Wij ← Wij

d

Xid

(WH)idHjd (5)

Wij ← Wij∑k Wkj

(6)

Hjd ← Hjd

i

WijXid

(WH)id(7)

We should take into account that matricesW andH mustbe randomly initialized with positive values.The advantage of this adaptive boosting approach is that atevery boosting step is more specialized in the wrong clas-sified examples, and the features that are being consideredare learned giving more importance to the most difficultexamples. This must lead the algorithm to a faster conver-gence, achieving the same or even best results than classicboosting but in fewer steps.

III. N ON PARAMETRIC DISCRIMINANT ANALYSIS

Once the faces are detected, they must be normalizedand classified in order to verify the identity of the sub-ject. In our scheme we will use a discriminant analysis toproject the data into a reduced space, followed by a near-est neighbor classification.As we use NN as classification rule, a proper feature ex-traction algorithm is needed, and Discriminant Analysiscan be very useful for this task. In this section we willintroduce classic Fisher Discriminant Analysis, to showlater how its nonparametric extension can solve the maindrawbacks of FLD: gaussian distribution assumption andreduced dimensionality of the generated subspaces. Wewill also show our modification of classic NDA [5], whichis expected to improve the NN classification.

A. Fisher Discriminant Analysis

The goal of discriminant analysis is to find the featuresthat best separate the different classes. One of the mostused criterionsJ to reach it is to maximize:

J = tr (SESI) (8)where the matricesSE andSI , generally represent the

scatter of sample vectors between different classes andwithin a class respectively. It has been shown (see [11],[12]) that theM ×D linear transform that satisfies:

W = arg maxW TSIW=I

tr(W T SEW ) (9)

optimizes the separability measureJ . This problem hasan analytical solution based on the eigenvectors of thescatter matrices. The algorithm presented in table II ob-tains this solution [12]. The most widely spread approachfor defining the within and between class scatter matri-ces is the one that makes use of only up to second orderstatistics of the data. This was done in a classic paper by

TABLE II

General algorithm for solving the discriminability optimization problem stated in

equation (9).

1. Given X the matrix containing data samples placed as N D-dimensional columns, SI the within class scatter matrix, and Mmaximum dimension of discriminant space,

2. Compute eigenvectors and eigenvalues for SI . Make Φ the matrixwith the eigenvectors placed as columns and Λ the diagonal matrixwith only the nonzero eigenvalues in the diagonal. MI is the numberof non-zero eigenvalues.

3. Whiten the data with respect to SI , to obtain MI dimensionalwhitened data,

Z = Λ−1/2

ΦTX.

4. Compute SE on the whitened data.5. Compute eigenvectors and eigenvalues forSE and make Ψ the ma-

trix with the eigenvectors placed as columns and sorted by decreas-ing eigenvalue value.

6. Preserve only the first ME = min{MI , M, rank(SE)} columns,ΨM = {ψ1, . . . , ψME } (those corresponding to the ME largesteigenvalues).

7. The resulting optimal transformation is W = ΨTMΛ−1/2ΦT and

the projected data, Y = WX = ΨTMZ

Fisher [4] and the technique is referred to as Fisher Dis-criminant Analysis (FLD). In FLD the within class scattermatrix is usually computed as a weighted sum of the class-conditional sample covariance matrices. If equiprobablepriors are assumed for classesCk, k = 1, . . . , K then

SI =1K

K∑

k=1

Σk (10)

whereΣk is the class-conditional covariance matrix, es-timated from the sample set. The between class-scattermatrix is defined as,

SE =1K

K∑

k=1

(µk − µ0)(µk − µ0)T (11)

whereµk is the class-conditional sample mean andµ0 isthe unconditional (global) sample mean.

Notice the rank ofSE is K − 1, so the number of ex-tracted features is, at most, one less than the number ofclasses. Also notice the parametric nature of the scattermatrix. The solution provided by FLD is blind beyondsecond-order statistics. So we cannot expect our methodto accurately indicate which features should be extractedto preserve any complex classification structure.

B. Nonparametric Discriminant Analysis

In [5] Fukunaga and Mantock present a nonparametricmethod for discriminant analysis in an attempt to over-come the limitations present in FLD. In nonparametricdiscriminant analysis the between-class scatterSE is ofnonparametric nature. This scatter matrix is generally fullrank, thus loosening the bound on extracted feature di-mensionality. Also, the nonparametric structure of thismatrix inherently leads to extracted features that preserverelevant structures for classification. We briefly exposethis technique, extensively detailed in [12].

Page 4: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

Fig. 1. First directions of NDA (solid line) and FLD (dashed line) projections, for two artificial datasets. Observe the results in theright-hand figure, where the FLD assumptions are not met.

In NDA, the between-class scatter matrix is obtainedfrom vectors locally pointing to another class. This is doneas follows. The extra-class nearest neighbor for a samplex ∈ Ck is defined asxE = {x′ ∈ Ck/‖x′ − x‖ ≤‖z − x‖, ∀z ∈ Ck}. In the same fashion we can definethe set of intra-class nearest neighbors asxI = {x′ ∈Lc/‖x′ − x‖ ≤ ‖z − x‖, ∀z ∈ Ck}.

From these neighbors, the extra-class differences aredefined as∆E = x − xE and the intra-class differencesas∆I = x − xI . Notice that∆E points locally to thenearest class (or classes) that does not contain the sample.The nonparametric between-class scatter matrix is definedas (assuming uniform priors),

SE =1N

N∑n=1

(∆En )(∆E

n )T (12)

where∆En is the extra-class difference for samplexn.

A parametric form is chosen for the within-class scat-ter matrix SI , defined as in (10). Figure (1) illustratesthe differences between NDA and FLD in two artificialdatasets, one with Gaussian classes where results are sim-ilar, and one where FLD assumptions are not met. Forthe second case, the bimodality of one of the classes dis-places the class mean introducing errors in the estimate ofthe parametric version ofSE . The nonparametric versionis not affected by this situation. We now make use of theintroduced notation to examine the relationship betweenNN and NDA. This results in a modification of the withinclass covariance matrix which we also introduce.

Given a training samplex, the accuracy of the1-NNrule can be directly computed by examining the ratio‖∆E‖/‖∆I‖. If this ratio is more than one,x will be cor-rectly classified. Given theM × D linear transformW ,the projected distances are defined as∆E,I

W = W∆E,I .Notice that this definition does not exactly agree with theextra and intra-class distances in projection space since,except for the orthonormal transformation case, we have

no warranty on distance preservation. Equivalence of bothdefinitions is asymptotically true. By the above remarks itis expected, that optimization of the following objectivefunction should improve or, at least not downgrade NNperformance,

W = arg maxE{‖∆I

W ‖2}=1E{‖∆E

W ‖2} (13)

This optimization problem can be interpreted as: findthe linear transform that maximizes the distance betweenclasses, preserving the expected distance among the mem-bers of a single class. Considering that,E{‖∆W ‖2} = E{(W∆)T (W∆)} = tr(W T ∆∆T W )

(14)where ∆ can be∆I or ∆E . Replacing (14) in (13)we have that this last equation is a particular case of eq.(9). Additionally, the formulas for the within and betweenclass scatter matrices are directly extracted from this equa-tion. In this case, the between-class scatter matrix agreeswith (12), but the within-class scatter matrix is now de-fined in a nonparametric fashion,

Sw =1N

N∑n=1

∆In∆I

n

T(15)

Given that we have an optimization problem of the formgiven in (9) the algorithm presented in table (II) can alsobe applied to the optimization of our proposed objectivefunction (13).

IV. E XPERIMENTS

We have designed a face verification application, com-posed of two parts: the face detection into the natural im-ages, and the verification of the identity of the face. In thecase of the face detector, we have opted for using a clas-sic boosting scheme instead of using adaptable features.Although the accuracies obtained in the adaptive schemeare higher, the main goal of this experiment was to designa real time system, and the temporal performance of the

Page 5: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

adaptive scheme was not still enough to satisfy this real-time restriction.And this real time requisite has also justified the electionof a feature extraction algorithm (to reduce the amount ofdata storage) in the verification step.

A. Face detection

The detection process has followed a boosting schemewhere the chosen classifier is the naive Bayes classifier.In particular we have used Bayes assuming aBernoullidistribution on the data [13]. In fact, this assumption isjustified because we have used binary data instead of theoriginal images.For each input image we have extracted a set of fixed fea-tures, using a detector of ridges and valleys (see [14] formore details) due to its robustness against changes in theillumination. The final features have been obtained bythresholding the response of the filters, assigning−1 to apixel when it is situated in a valley,1 when it is on a ridge,and 0 otherwise (see fig IV-A). To convert this ternaryrepresentation into a binary one, we have separated thefiltered image into two representations where we put to 1the pixels where there is a ridge/valley and 0 otherwise.Both representations have been vectorized and concate-nated (so at the end we have a binary vector with doubledimensionality).As we use a Bayesian classifier, for each target imageXi we decide thatXi is a face if p(Xi|CFaces) >

p(Xi|CNonFaces) (see the MAP Rule [15]). To estimatethe conditional probabilities for the faces

p(X|CFaces) =D∏

d=1

pXd

d q1−Xd

d (16)

whereXd is d-th pixel of the image andpd is the prob-ability of finding a 1 in the pixel d, andqd the probabilityof finding a0 in the pixel d (pd = 1 − qd). This proba-bilities p andq can be estimated directly from the trainingsamples by finding the frequencies of the ones and zerosof the face vectors. In a similar way the conditional prob-abilities for the non faces are:

p(X|CNonFaces) =D∏

d=1

pXd

d q1−Xd

d (17)

wherep andq are obtained asp andq but using the nonface instead of the face samples.This Bayesian classifier has been used into a boostingscheme, where we have selected one feature at each step.We have used6500 training images, with1500 faces and5000 non face images. The global performance of the finaldetector is:2.45% false detection rate and only a0.83%false positive rate.In figure IV-A, we show some examples images wrongly

classified by the detector. Some twisted and blurry facesare confused, and also some natural images which presenta structure very similar to a face are also misclassified.

B. Face detection using adaptable features

Although we have discarded its use in the real time ap-plication due to the computational constraints, we alsoshow the results obtained with the adaptable boostingscheme, to compare the efficiency of both algorithms.In fact the introduction of the feature extraction into theboosting scheme is still an interesting line of research andcan be still optimized.We have trained the boosting scheme using1500 face im-ages extracted from different face databases, and5000 nonface images extracted from random natural images. Totest the performance of the algorithm, a huge set of28000images has been used (26000 non faces and 2000 faces).We have achieved a99.73% of global accuracy, with only1.07% of false rejection and0.2% of false positive ratio.In figure 2 can be seen how using the adaptable featureapproach the results can be improved significantly as wedetect1.4% more faces. In a similar way we are able toimprove the rejection of non face images.

C. Face Recognition in an Uncontrolled Environment

A second experiment was performed using the mod-ified NDA representation for real-time recognition in anuncontrolled environment. We now detail the design ofthis experiment, which attempts to simulate a surveil-lance situation. As most face processing applications, oursconsisted in three engines: detection, normalization andrecognition. Now we will focus on the recognition stage,in the first part of this section we have descriebed our de-tection scheme.Given a single frame of576 × 768 pixels, the image wasresized to288 × 384 to avoid the effect of interlacing.Detection was performed using the above scheme, but ad-ditional layers were added at the beginning and end of thecascade. The new first layer performs a very fast elimi-nation of the flat regions from the image. Flat regions arefrequent in indoor visual surveillance (walls, doors, ceil-ings, floors, etc.). The last layer is an expensive stage thattakes care of additional false positives, includingdifficultfaces. In our case, since the final objective is recognition,these face images are also discarded since they surely willnot provide any kind of certainty in the recognition. Thecomputational cost of this layer is minimum since it dealsonly with the subimages that successfully went throughthe whole cascade. The size of the detected faces waschosen as32 × 32, such that the size of the face in theoriginal (interlaced) frame was64× 64 pixels.The normalization engine has two parts. One takes care

Page 6: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

(a) (b)

Fig. 2. Accuracies obtained in the face and non face images as a function of the boosting steps. As can be seen the adaptive featurelearning achieves better results.

Fig. 3. Example of ridges and valleys detection for a subset of face and non face images.

Fig. 4. Examples of false negative and false positive images obtained in the boosting scheme. As can be seen the false positives areimages with the structure similar to a face (eyes, noose and mouth).

of illumination and a second one takes care of geometricnormalization. Given a candidate face, illumination wasnormalized based on the local mean and variance. Ge-ometric normalization was quite simplistic. Tilt was nottaken in to account. Yaw and normal face rotations werenormalized taking advantage of the almost bilateral sym-metry of faces. Grayvalue frequencies were obtained fordifferent unidimensional projections and this informationused for obtaining the position of the eyes and face pose.

Since this scheme results in a very poor approach an ad-ditional threshold was introduced in order to decide thegoodness of the normalization. Actually, this thresholdacts as an additional layer of detection. Face candidatesyielding values below this threshold were disregarded.Finally, for the recognition engine we considered ascheme based on a linear projection using the NDA al-gorithm followed by a nearest neighbor classification.Recognition was performed on the64 × 64 faces in the

Page 7: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

original (interlaced) frame. These face images were trans-formed into two32×64 face images with alternating rows.This avoids the effect of interlace without the need for adimensionality reduction. Additionally, for each detectedface we count with two samples of the subject, each sam-ple taken within a very short lapse of time (a fraction of asecond). The subject label is chosen by combining the re-sults obtained for each of the two samples. So our initialmeasurements consist in32 × 64 face images (2048 di-mensional vectors). This situation is illustrated in fig. (5).In fig. (5.a) some detected faces from a video sequenceare shown. Fig. (5.b) shows these same faces after thenormalization. Only one of the two samples is displayed.Given these measurement pairs, we now detail the featureextraction and classification stages. Our face detector andnormalizer was connected to a television in order to learnthe PCA transform from the2048 dimensional measure-ments. Images of benchmark datasets were added. The fi-nal PCA projection matrix was learnt using approximately30000 samples. We chose to preserve224 principal com-ponents. This choice preserves approximately97% of thevariance. Given a training set with the labelled subjects,we project them in the PCA space prior to learning theNDA representation. In this case, NDA cannot be straight-forwardly applied. The problem is that, as we mentioned,each subject yields two samples and, generally these sam-ples are nearest neighbors. So instead of using the nearestneighbor we chose to train NDA using the second near-est neighbor.128 NDA components were preserved. Soeach detected face turned out to be represented by2 128-dimensional vectors, one per deinterlaced sample. Classi-fication was performed using the5 nearest neighbors av-erage voting for combining the results of each sample.All the parameters from this scheme (PCA and NDA di-mensionality, number of nearest neighbors and classifiercombination policy) were set by cross-validating the train-ing set. We now detail how this set was built and the ex-perimental results.

A Sony EVI D-31 camera was located in the bottom ofa staircase in the Computer Vision Center. This camerawas connected to a VCR and four hours of recordings atpeek hours were gathered each day every other week, fora total lapse of6 weeks. The face detector was appliedto these videos and the detected images saved and manu-ally labeled. From the approximately80 different peopledetected in all tapes, only those47 with more than30 de-tected faces were included in the gallery. The total numberof faces for these47 subjects was4176, resulting in8352samples after deinterlacing. Approximately88 faces persubject.10-fold cross validation on the faces was used toevaluate the performance of our classifier. For this experi-ment, classification accuracy was96.83%. We also did asupervised cross-validation so no samples from the same

(a)

(b)

Fig. 5. Example of some faces used in the face recognitionexperiment, before and after normalization. Notice the badquality of the face images.

day were at the same time in training and test sets. Resultswere very similar, yielding a95.9% accuracy.Recognition was also evaluated online, recording therecognition results and video to manually evaluate the on-line classification accuracy. Recognition rate for approx-imately2000 test images belonging to the47 subjects inthe gallery was92.2%. In this experiment, we also ob-served that the classifier would greatly benefit from tem-poral integration which, at the moment, was not imple-mented. The frame rate of the application with all threeengines working and this gallery of47 subjects was ap-proximately 6 fps. Figure (6) shows to frames taken di-rectly from the working application. A first frame illus-trates the environment in which the experiment took place,and the second frame illustrates the recognizer at work.

V. CONCLUSIONS

In this paper a real time face recognition framework hasbeen presented. Two different problems have been solved,first the problem of the face detection in uncontrolled envi-ronments has been solved using a boosting scheme basedon the naive Bayes classifier. As it has been shown, thisscheme achieves very low false positive ratios. And thena face classification scheme based on the NDA feature ex-

Page 8: Classifier combination applied to real time face detection …jordi/AVR04.pdffor face detection and classification. We have used a Bayesian classifier for face detection and a nearest

(a)

(b)

Fig. 6. Frames extracted from the human ID at distance appli-cation in the Computer Vision Center.

traction followed by a nearest neighbor classification hasbeen performed. The results show very good accuraciesin the tests made in a real environment, and also we haveachieved a very good performance in terms of time.We also show a new approach to boosting. We have in-troduced the feature learning into the boosting scheme, byadapting the features used to the most difficult examples.The algorithm achieves excellent accuracies, but furtherwork is needed to speed up the algorithm in order to beused in a real time environment.Another open consideration in the boosting scheme is thefact that we can take benefit of asymmetric classifiers in

the boosting process. One desirable property in face de-tection is not to lose faces in the detection process andthis property should be introduced in the internal boostingclassifier.

ACKNOWLEDGMENT

This work is supported by MCYT grant TIC2003-00654, Ministerio de ciencia y tecnologia, Spain.

REFERENCES

[1] P. Viola and M. Jones, “Robust real-time object detection,”Inter-national Journal of Computer Vision - to appear, 2002.

[2] M. Kirby and L. Sirovich, “Application of the karhunen-loeve pro-cedure for the characterization of human faces,”IEEE Trans. Pat-tern Anal. Machine Intell., vol. 12, no. 1, pp. 103–108, Jan 1990.

[3] M. Turk and A. Pentland, “Eigenfaces for recognition,”Journal ofCognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.

[4] R. Fisher, “The use of multiple measurements in taxonomic prob-lems,”Ann. Eugenics, vol. 7, pp. 179–188, 1936.

[5] K. Fukunaga and J. Mantock, “Nonparametric discriminant analy-sis,” IEEE Trans. Pattern Anal. Machine Intell., vol. 5, no. 6, pp.671–678, nov 1983.

[6] Y. Freund and R. E. Schapire, “Experiments with a newboosting algorithm,” inInternational Conference on MachineLearning, 1996, pp. 148–156. [Online]. Available: cite-seer.nj.nec.com/freund96experiments.html

[7] M. Skurichina and R. P. Duin, “Bagging for linear classifiers,”Delft University of Technology, Tech. Rep., 2002.

[8] M. Skurichina and R. P.W.Duin, “Bagging, boosting and the ran-dom subspace method for linear classifiers,”Pattern Analysis andApplications, vol. 5, pp. 121–135, 2002.

[9] D. D. Lee and H. S. Seung, “Learning the parts of objects withnonnegative matrix factorization,”Nature, vol. 401, pp. 788–791,July 1999.

[10] M. D.Guillamet and J.Vitria, “Weighted non-negative matrix fac-torization for local representations,” inCVPR, Kauai, Hawaii,2001, pp. 942–947.

[11] P. Devijver and J. Kittler,Pattern Recognition: A Statistical Ap-proach. London, UK: Prentice Hall, 1982.

[12] K. Fukunaga, Introduction to Statistical Pattern Recognition,2nd ed. Boston, MA: Academic Press, 1990.

[13] M. Bressan, “Statistical independence for classification of highdimensional data,” Ph.D. dissertation, Universitat Autonoma deBarcelona, 2003.

[14] J. S. J. V. A. Lopez, F. Lumbreras, “Evaluation of methods forridge and valley detection,”IEEE ransactions on pattern analysisand Machine Intelligence, vol. 21, p. 327:335, 1999.

[15] R. D. P.Hart and D. Stork,Pattern Classification, 2nd ed. NewYork: Jon Wiley and Sons, Inc, 2001.