Face recognition using LDA mixture model

Pattern Recognition Letters 24 (2003) 2815–2821

www.elsevier.com/locate/patrec

Face recognition using LDA mixture model

Hyun-Chul Kim, Daijin Kim *, Sung Yang Bang

Department of Computer Science and Engineering, POSTECH, San 31, Hyoja-Dong, Nam-Gu, Pohang 790-784, South Korea

Received 22 February 2002; received in revised form 8 May 2003

Abstract

Linear discriminant analysis (LDA) provides the projection that discriminates data well, and shows a good per-

formance for face recognition. However, since LDA provides only one transformation matrix over the whole data, it is

not sufficient to discriminate complex data consisting of many classes with high variations, such as human faces. To

overcome this weakness, we propose a new face recognition method based on the LDA mixture model, where the set of

all classes are partitioned into several clusters and we obtain a transformation matrix for each cluster. This accurate and

detailed representation will improve classification performance. Simulation results of face recognition show that LDA

mixture model outperforms PCA, LDA, and PCA mixture model in terms of classification performance.

� 2003 Elsevier B.V. All rights reserved.

Keywords: Linear discriminant analysis; LDA mixture model; PCA mixture model; Face recognition

1. Introduction

Face recognition is an active research area

spanning several research fields such as image

processing, pattern recognition, computer vision,

and neural networks (Chellappa et al., 1995). Facerecognition has many applications, such as bio-

metrics system, surveillance, and content-based

video processing systems. However, there are still

many open problems, such as recognition under

higher illumination variations.

There are two main approaches for face rec-

ognition (Chellappa et al., 1995; Brunelli and

* Corresponding author. Tel.: +82-562-279-2249; fax: +82-

562-279-2299.

E-mail address: [email protected] (D. Kim).

0167-8655/$ - see front matter � 2003 Elsevier B.V. All rights reserv

doi:10.1016/S0167-8655(03)00126-0

Poggio, 1993). The first approach is the feature-

based matching approach, which uses the rela-

tionship between facial features such as eyes,

mouth and nose (Brunelli and Poggio, 1993;

Wiskott et al., 1997). The second approach is the

template matching approach using the holisticfeatures of the face image (Brunelli and Poggio,

1993; Turk and Pentland, 1991; Belhumeur et al.,

1997).

The eigenface method is a well-known template

matching method (Sirovich and Kirby, 1987; Turk

and Pentland, 1991). Face recognition using the

eigenface method is performed by feature values

transformed by PCA. Linear discriminant analysis(LDA) has also been applied to face recogni-

tion. The face recognition method using LDA is

called the fisherface method. It can be applied di-

rectly to the gray image (Belhumeur et al., 1997;

ed.

mail to: [email protected]

2816 H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821

Etemad and Chellappa, 1997), or feature vectors of

the gray image extracted on a sparse grid (Duc

et al., 1999). In both cases the classification per-

formance is significantly better than their PCA

counter parts.

LDA is a well-known classical statistical tech-nique using the projection which maximizes the

ratio of scatter among the data of different classes to

the scatter within the data of the same class (Duda

et al., 2001). Features obtained by LDA are useful

for pattern classification since theymake the data of

the same class closer to each other, and the data of

different classes further away from each other.

Typically, LDA is compared to PCA becauseboth methods are multivariate statistical tech-

niques for projection. PCA attempts to locate the

projection that reduces the dimensionality of a

data set while retaining as variation in the data set

as much as possible (Jolliffe, 1896). Since PCA

does not use class information, LDA usually out-

performs PCA for pattern classification.

Although LDA usually gives a good discrimi-nation performance, LDA has a drawback in that

it has only one transformation matrix. For the

problem of many classes with high variations, only

one transformation matrix is not sufficient for a

good discrimination. To overcome this drawback,

we propose LDAmixture model, which uses several

transformation matrices. We partition the set of all

classes into several clusters, and we apply LDAtechnique for each cluster. In this way, we obtain

several transformation matrices. These transfor-

mation matrices can be used for recognition clus-

terwisely. For face image data, each cluster may

correspond to a race or a group of similar faces. By

using more than one transformation matrix, we can

obtain a better discrimination performance.

This paper is organized as follows. Section 2describes the PCA mixture model and LDA mix-

ture model. Section 3 shows and discusses the

simulation results. Finally, we present our con-

clusions.

2. LDA mixture model

As mentioned previously, the LDA mixture

model requires the partitioning of the set of all

classes to several clusters. In this work, we take the

PCA mixture model for data partitioning because

it can accurately divide the entire data into many

constituent components, which will make the for-

mulation of LDA mixture model simpler. There-

fore, we explain PCA mixture model first, whichwill be followed by the explanation about LDA

mixture model.

2.1. PCA mixture model

In a mixture model (Jacobs et al., 1991; Jordan

and Jacobs, 1994), a class is partitioned into a

number of clusters and its density function of then-dimensional observed data x ¼ fx1; . . . ; xng is

represented by a linear combination of component

densities of the partitioned clusters as

P ðxÞ ¼XKk¼1

P ðxjk; hkÞP ðkÞ; ð1Þ

where Pðxjk; hkÞ and P ðkÞ represent the conditionaldensity and the prior probability of the kth cluster,respectively, and hk is the unknown model para-

meter of the kth cluster. The prior probabilities arechosen to satisfy the condition

PKk¼1 PðkÞ ¼ 1. The

conditional density function P ðxjk; hkÞ is oftenmodelled by a Gaussian function as

P ðxjk; hkÞ ¼1

ð2pÞn=2jRkj1=2

� exp�� 12ðx� lkÞ

TR�1k ðx� lkÞ

�;

ð2Þ

where lk and Rk are the sample mean and co-

variance of the kth cluster, respectively.The number of data points required to estimate

the parameters of density functions defined on

high dimensional feature spaces increase at leastproportionally to the square of the dimensionality

(curse of dimensionality). Consequently, in prac-

tical applications such as face classification, the

dimensionality of the feature space needs to be

reduced. To accomplish this, we use the PCA

technique in this work. Let E½x be the expectationof the stochastic vector x. In PCA (Hotelling,

1933), a set of observed n-dimensional data vectorX ¼ fxpg, p 2 f1; . . . ;Ng is reduced to a set of m-

-30 -20 -10 0 10 20 30 40 50-30

-20

-10

0

10

20

30

40

50

Fig. 1. An illustration of PCA mixture model.

H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821 2817

dimensional feature vector S ¼ fspg, p 2 f1; . . . ;Ng by a transformation matrix T as

sp ¼ T tðxp � E½xÞ; ð3Þwhere m6 n, T ¼ ðw1; . . . ;wmÞ and the vector wj

is the eigenvector corresponding to the jth larg-est eigenvalue of the sample covariance matrixC ¼ ð1=NÞ

PNp¼1 ðxp � E½xÞðxp � E½xÞT, such that

Cwk ¼ kkwk. The m principal axes T are ortho-

normal axes onto which the retained variance

under projection is maximal. One property of

PCA is that projection onto the principal sub-

space minimizes the squared reconstruction errorPNp¼1 kxp � x̂xk2. The optimal linear reconstruction

of x̂x is given by x̂x ¼ Tsp þ E½x, where sp ¼T tðxt � E½xÞ, and the orthogonal columns of Tspan the space of the principal m eigenvectors

of C .We consider a PCA mixture model which

combines the above two models (Eqs. (1) and (3))

in a way that the component density of the mixture

model can be estimated on the PCA transformed

space as

P ðxÞ ¼XKk¼1

P ðxjk; hkÞP ðkÞ; ð4Þ

P ðxjk; hkÞ ¼ ðskjk; hkÞ; ð5Þ

where sk ¼ WTk ðx� lkÞ. The PCA feature vectors

sk are decorrelated due to the orthogonality of thetransform matrix W k. Thus, its covariance matrix

RSk ¼ E½sksTk is a diagonal matrix whose diagonalelements correspond to the principal eigenvalues.

Next, the conditional density function P ðskjk; hkÞof the PCA feature vectors in the kth cluster can besimplified as

P ðskjk; hkÞ ¼1

ð2pÞm=2jRSk j1=2

� exp�� 12sTk R

S�1

k sk

�

¼Ymj¼1

1

ð2pÞ1=2k1=2k;j

exp

(�

s2j2kk;j

); ð6Þ

where s ¼ fs1; . . . ; smg, and fkk;1; . . . ; kk;mg are them dominant eigenvalues of the feature covariancematrix RS

k in the kth cluster. The proposed model,

which has no Gaussian error term, can be con-

sidered as a simplified form of the Tipping and

Bishop model (Tipping and Bishop, 1999).

Fig. 1 illustrates a PCA mixture model where

the number of mixture components K ¼ 2, thedimension of feature vectors m ¼ 2, the line seg-ments in each cluster represent the two column

vectors w1 and w2, and the intersection of two linesegments represents a mean vector E½x.The parameters of a mixture model can be es-

timated by an EM algorithm, which maximizes the

likelihood (Dempster et al., 1977). The EM algo-

rithm for the proposed model can be easily de-rived. Each iteration consists of two steps: an

expectation step (E-step) followed by a maximi-

zation step (M-step). Each step is run for each

mixture component. The EM Algorithm starts its

run after the parameters are initialized, and stops

when the density undergoes no further changes.

(1) E-step: Given the feature data set X and theparameters HðtÞ of the mixture model at the tthiteration, we estimate the posterior distribution

Pðzjx;HðtÞÞ using

Pðzjx;HðtÞÞ ¼ P ðxjz;HðtÞÞP ðzÞPKk¼1 P ðxjk;HðtÞÞP ðkÞ

; ð7Þ

where Pðxjz;HðtÞÞ is calculated by Eqs. (5) and (6).


(2) M-step: Next, the new means lX ðtþ1Þk and the

new covariance matrixes RX ðtþ1Þ

k of the kth mixturecomponent are obtained by the following update

formula.

lX ðtþ1Þ

k ¼PN

p¼1 P ðkjxp;HðtÞÞxpPN

p¼1 P ðkjxp;HðtÞÞ

; ð8Þ

RX ðtþ1Þ

k ¼PN

p¼1 P ðkjxp;HðtÞÞðxp � lX ðtÞ

k ÞTðxp � lX ðtÞk ÞPN

p¼1 P ðxpjHðtÞÞ:

ð9Þ

The new eigenvalue parameters kðtþ1Þk;j and the new

eigenvector (PCA basis) parameters wk;j are ob-

tained by selecting the largest m eigenvalues in theeigenvector computation (PCA computation) as

RX ðtþ1Þ

k wk;j ¼ kðtþ1Þk;j wk;j: ð10Þ

The mixing parameters PðzpÞ can be estimated asfollows:

P ðkÞ ¼ 1

N

XNp¼1

P ðkjxp;HðtÞÞ: ð11Þ

Repeat the two steps above until the sequence of

the estimations corresponding to the three sought

parameters do not change between the iterations

(convergence).

2.2. LDA mixture model

LDA has one transformation matrix over all

classes. This property degrades the performance of

LDA because only one transformation matrix is

not enough for the classification of complex data

with many classes with high variations. To over-

come this limitation, we propose to use LDA

mixture model that uses several transformation

matrices over all classes. LDA mixture modelpartitions the set of all classes into an appropriate

number of clusters and applies LDA to each

cluster, independently.

Specifically, we apply the PCA mixture model

to the set of means mi of each class with K mixturecomponents. Then, for the kth mixture compo-nent, we obtain a cluster mean ck, a transforma-

tion matrix Tk, and a diagonal matrix Vk with

eigenvalues, where Vk is a diagonal matrix whose

diagonal element is eigenvalues kk;j which is the jthlargest eigenvalue of the covariance matrix. In this

case, the probabilistic covariance matrix for the

kth mixture component is TkVkTtk. Using this re-

sult, we get the between-class scatter matrix for the

k mixture component (which corresponds to thekth cluster) as

SBk ¼ TkVkTtk; ð12Þ

and a within-scatter matrix as

SWk ¼Xl2Lk

1

nl

Xx2Cl

ðx�mlÞðx�mlÞt; ð13Þ

where k ¼ 1; 2; . . . ;K, l ¼ 1; 2; . . . ; c, and Lk is a set

of class labels belonging to the kth mixture com-ponent that is determined by

Lk ¼ j j j ¼ argmini

ðmi � ckÞS�1Bj ðmi � ckÞt

� �:

Based on SWk and SBk , the transformation ma-

trix W k for the kth mixture component is deter-mined so as to maximize the criterion function

JkðWÞ ¼ jW tSBW jjW tSWkW j : ð14Þ

The columns of optimal W k are the generalizedeigenvectors wkj that correspond to the largest

eigenvalues in

SBwkj ¼ kkjSWkwkj : ð15Þ

2.3. Model selection

When we use a kind of mixture models, we have

a model selection issue, i.e. how many mixture

components we should use. One strategy to find

the proper number of mixture components iscross-validation. We can set aside some of a

training set as a validations set, and evaluate the

performance for the validation set over different

number of mixture components. We can use the

number of mixture components which give the

best performance for the validation set.


3. Simulation results and discussion

We took a partial set of PSL database obtained

from the MPEG7 community (Wang and Tan,

2000). PSL database consists of normalized imagesof 271 people, where images of some people have

lighting variations and images of the other people

have pose variations. Among all images, we se-

lected images of 133 people with lighting varia-

tions. In this partial set, there are five images for

each person, consisting of normal, smiling, angry

expressions, left-illumination, and right-illumina-

tion images. Since the number of images for eachclass is too small, we used 5-fold cross-validation

to test the classification performances. Fig. 2

shows some images used in the simulation.

We applied LDA and LDA mixture model to

the data. For comparison, we also applied PCA

(eigenface method (Turk and Pentland, 1991)) and

PCA mixture model. We used the following rec-

ognition method for each method, which is similarto the methods used in synthetic data classifica-

tion. For PCA and LDA, we transform the

training data and the test data x by a transfor-mation matrix (T for PCA, W for LDA), and as-

sign the test data x to the class CPCAðLDAÞ of the

Fig. 2. Some face images used in the simulation.

transformed training data that is nearest to the

transformed test data as

CPCA ¼ L argminxr

kðx� xrÞTk�

; ð16Þ

and

CLDA ¼ L argminxr

kðx� xrÞWk�

; ð17Þ

where xr is a sample in the training set and LðxrÞindicates the class label of a sample xr. For PCA

mixture model and LDA mixture model, we

transform the test data x and the training data xr

by a transformation matrix (Tk for PCA mixture

model, W k for LDA mixture model) of corre-

sponding mixture component, and assign the

data to the class CPMðLMÞ of the corresponding

transformed training data that is nearest to the

transformed test data by the corresponding trans-

formation matrix Tk or W k as

CPM ¼ L argminxr

kðx� xrÞT IðxrÞk�

; ð18Þ

and

CLM ¼ L argminxr

kðx� xrÞW IðxrÞk�

; ð19Þ

where xr is a sample in the training set, LðxrÞ in-dicates the class label of a sample xr, and IðxrÞindicates the mixture component label, to which

the training sample xr belongs, determined by

IðxrÞ ¼ argmini

ðmi

�� ckÞS�1

Bj ðmi � ckÞt�:

Even though more than two mixture compo-

nents have been taken, we do not observe any

significant improvement in classification perfor-

mance. So, we used only two mixture components

for learning the PCA mixture model for each classin PCA mixture model and LDA mixture model.

Fig. 3 plots the classification errors according to a

different number of features for the test data when

each method is applied. The number of features

and classification errors in the best case are shown

in Table 1. From Fig. 3, we note that the LDA

mixture model provides a good classification

performance as the number of features taken

0 10 20 30 40 50 600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of features

Cla

ssifi

catio

n er

ror

rate

PCAPM LDALM

Fig. 3. Classification errors vs. the number of features.

Table 1

The best performance and the corresponding number of fea-

tures

Methods Number of features Classification error

(%)

PCA 92 22.41

PCA mixture 36 18.65

LDA 84 21.50

LDA mixture 54 12.78


increases. For a small number of features, PCAmixture model outperforms LDA mixture model,

but LDA mixture model outperforms PCA mix-

ture model when the number of features used ex-

ceeds 35. We also note that the classification

performance of LDA is better than that of PCA,

and both PCA mixture model and LDA mixture

model outperform PCA and LDA, respectively.

We manually inspected the automatically foundmembers of the two clusters. We could not find a

special pattern except that similar face images were

most likely to be in the same cluster.

4. Conclusion

We proposed LDA mixture model to overcomea limitation of LDA, which provided several

transformation matrices by taking one transfor-

mation matrix for each cluster. This modification is

trying to represent the underlying data represen-

tation more accurately by PCA mixture model and

to improve the data discrimination by the use of

several clusterwise transformation matrices. Lim-ited simulation results showed that LDA mixture

model outperformed PCA, LDA, and PCA mix-

ture model for enough number of features. For the

complex data consisting of many classes with high

variations such as face images, LDAmixture model

can effectively be used as an alternative of LDA.

Acknowledgements

The authors would like to thank the Ministry

of Education of Korea for its financial support

toward the Electrical and Computer Engineering

Division at POSTECH through its BK21 program.

This research was also supported by Brain Science

and Engineering Research Program supported bythe Ministry of Science and Technology of Korea.

References

Belhumeur, P., Hespanha, J., Kriegman, D., 1997. Eigenfaces

vs. fisherfaces: Class specific linear projection. IEEE Trans.

PAMI 19 (7), 711–720.

Brunelli, R., Poggio, T., 1993. Face recognition: Feature versus

templates. IEEE Trans. Pattern Anal. Machine Intell. 15

(10), 1042–1052.

Chellappa, R., Wilson, C., Sirohey, S., 1995. Human and

machine recognition of faces: A survey. Proc. IEEE 83 (5),

705–740.

Dempster, P., Laird, N., Rubin, D., 1977. Maximum likelihood

from incomplete data via the EM algorithm. J. Roy. Statist.

Soc. Ser. B 39 (4), 1–38.

Duc, B., Fischer, S., Bigun, J., 1999. Face authentication with

Gabor information on deformable graphs. IEEE Trans.

Image Process. 8 (4), 504–516.

Duda, R., Hart, P., Stork, D., 2001. Pattern Classification.

Wiley, New York.

Etemad, K., Chellappa, R., 1997. Discriminant analysis for

recognition of human face images. J. Opt. Soc. Amer. A 14

(8), 1724–1733.

Hotelling, H., 1933. Analysis of a complex statistical variables

into principal components. J. Educ. Psychol. 24, 417–441.

Jacobs, R., Jordan, M., Nowlan, S., Hinton, G., 1991. Adaptive

mixtures of local experts. Neural Comput. 3, 79–87.

Jolliffe, I.T., 1896. Principal Component Analysis. Springer-

Verlag, New York.


Jordan, M., Jacobs, R., 1994. Hierarchical mixtures of experts

and the EM algorithm. Neural Comput. 6 (5), 181–214.

Sirovich, L., Kirby, M., 1987. Low-dimensional procedure for

the characterization of human faces. J. Opt. Soc. Amer., A 4

(3), 519–524.

Tipping, M., Bishop, C., 1999. Mixtures of probabilistic

principal component analyzers. Neural Comput. 11, 443–

482.

Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J.

Cognitive Neurosci. 3 (1).

Wang, L., Tan, T.K., 2000. Experimental results of face

description based on the 2nd-order eigenface method,

ISO/MPEG m6001, Geneva.

Wiskott, L., Fellous, J.M., Kruger, N., Malsburg, C., 1997.

Face recognition by elastic bunch graph matching. IEEE

Trans. Pattern Anal. Machine Intell. 19 (7), 775–780.

Face recognition using LDA mixture model

Documents

Transcript of Face recognition using LDA mixture model