Face recognition using LDA mixture model
Click here to load reader
-
Upload
hyun-chul-kim -
Category
Documents
-
view
221 -
download
6
Transcript of Face recognition using LDA mixture model
![Page 1: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/1.jpg)
Pattern Recognition Letters 24 (2003) 2815–2821
www.elsevier.com/locate/patrec
Face recognition using LDA mixture model
Hyun-Chul Kim, Daijin Kim *, Sung Yang Bang
Department of Computer Science and Engineering, POSTECH, San 31, Hyoja-Dong, Nam-Gu, Pohang 790-784, South Korea
Received 22 February 2002; received in revised form 8 May 2003
Abstract
Linear discriminant analysis (LDA) provides the projection that discriminates data well, and shows a good per-
formance for face recognition. However, since LDA provides only one transformation matrix over the whole data, it is
not sufficient to discriminate complex data consisting of many classes with high variations, such as human faces. To
overcome this weakness, we propose a new face recognition method based on the LDA mixture model, where the set of
all classes are partitioned into several clusters and we obtain a transformation matrix for each cluster. This accurate and
detailed representation will improve classification performance. Simulation results of face recognition show that LDA
mixture model outperforms PCA, LDA, and PCA mixture model in terms of classification performance.
� 2003 Elsevier B.V. All rights reserved.
Keywords: Linear discriminant analysis; LDA mixture model; PCA mixture model; Face recognition
1. Introduction
Face recognition is an active research area
spanning several research fields such as image
processing, pattern recognition, computer vision,
and neural networks (Chellappa et al., 1995). Facerecognition has many applications, such as bio-
metrics system, surveillance, and content-based
video processing systems. However, there are still
many open problems, such as recognition under
higher illumination variations.
There are two main approaches for face rec-
ognition (Chellappa et al., 1995; Brunelli and
* Corresponding author. Tel.: +82-562-279-2249; fax: +82-
562-279-2299.
E-mail address: [email protected] (D. Kim).
0167-8655/$ - see front matter � 2003 Elsevier B.V. All rights reserv
doi:10.1016/S0167-8655(03)00126-0
Poggio, 1993). The first approach is the feature-
based matching approach, which uses the rela-
tionship between facial features such as eyes,
mouth and nose (Brunelli and Poggio, 1993;
Wiskott et al., 1997). The second approach is the
template matching approach using the holisticfeatures of the face image (Brunelli and Poggio,
1993; Turk and Pentland, 1991; Belhumeur et al.,
1997).
The eigenface method is a well-known template
matching method (Sirovich and Kirby, 1987; Turk
and Pentland, 1991). Face recognition using the
eigenface method is performed by feature values
transformed by PCA. Linear discriminant analysis(LDA) has also been applied to face recogni-
tion. The face recognition method using LDA is
called the fisherface method. It can be applied di-
rectly to the gray image (Belhumeur et al., 1997;
ed.
![Page 2: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/2.jpg)
2816 H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821
Etemad and Chellappa, 1997), or feature vectors of
the gray image extracted on a sparse grid (Duc
et al., 1999). In both cases the classification per-
formance is significantly better than their PCA
counter parts.
LDA is a well-known classical statistical tech-nique using the projection which maximizes the
ratio of scatter among the data of different classes to
the scatter within the data of the same class (Duda
et al., 2001). Features obtained by LDA are useful
for pattern classification since theymake the data of
the same class closer to each other, and the data of
different classes further away from each other.
Typically, LDA is compared to PCA becauseboth methods are multivariate statistical tech-
niques for projection. PCA attempts to locate the
projection that reduces the dimensionality of a
data set while retaining as variation in the data set
as much as possible (Jolliffe, 1896). Since PCA
does not use class information, LDA usually out-
performs PCA for pattern classification.
Although LDA usually gives a good discrimi-nation performance, LDA has a drawback in that
it has only one transformation matrix. For the
problem of many classes with high variations, only
one transformation matrix is not sufficient for a
good discrimination. To overcome this drawback,
we propose LDAmixture model, which uses several
transformation matrices. We partition the set of all
classes into several clusters, and we apply LDAtechnique for each cluster. In this way, we obtain
several transformation matrices. These transfor-
mation matrices can be used for recognition clus-
terwisely. For face image data, each cluster may
correspond to a race or a group of similar faces. By
using more than one transformation matrix, we can
obtain a better discrimination performance.
This paper is organized as follows. Section 2describes the PCA mixture model and LDA mix-
ture model. Section 3 shows and discusses the
simulation results. Finally, we present our con-
clusions.
2. LDA mixture model
As mentioned previously, the LDA mixture
model requires the partitioning of the set of all
classes to several clusters. In this work, we take the
PCA mixture model for data partitioning because
it can accurately divide the entire data into many
constituent components, which will make the for-
mulation of LDA mixture model simpler. There-
fore, we explain PCA mixture model first, whichwill be followed by the explanation about LDA
mixture model.
2.1. PCA mixture model
In a mixture model (Jacobs et al., 1991; Jordan
and Jacobs, 1994), a class is partitioned into a
number of clusters and its density function of then-dimensional observed data x ¼ fx1; . . . ; xng is
represented by a linear combination of component
densities of the partitioned clusters as
P ðxÞ ¼XKk¼1
P ðxjk; hkÞP ðkÞ; ð1Þ
where Pðxjk; hkÞ and P ðkÞ represent the conditionaldensity and the prior probability of the kth cluster,respectively, and hk is the unknown model para-
meter of the kth cluster. The prior probabilities arechosen to satisfy the condition
PKk¼1 PðkÞ ¼ 1. The
conditional density function P ðxjk; hkÞ is oftenmodelled by a Gaussian function as
P ðxjk; hkÞ ¼1
ð2pÞn=2jRkj1=2
� exp�� 12ðx� lkÞ
TR�1k ðx� lkÞ
�;
ð2Þ
where lk and Rk are the sample mean and co-
variance of the kth cluster, respectively.The number of data points required to estimate
the parameters of density functions defined on
high dimensional feature spaces increase at leastproportionally to the square of the dimensionality
(curse of dimensionality). Consequently, in prac-
tical applications such as face classification, the
dimensionality of the feature space needs to be
reduced. To accomplish this, we use the PCA
technique in this work. Let E½x be the expectationof the stochastic vector x. In PCA (Hotelling,
1933), a set of observed n-dimensional data vectorX ¼ fxpg, p 2 f1; . . . ;Ng is reduced to a set of m-
![Page 3: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/3.jpg)
-30 -20 -10 0 10 20 30 40 50-30
-20
-10
0
10
20
30
40
50
Fig. 1. An illustration of PCA mixture model.
H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821 2817
dimensional feature vector S ¼ fspg, p 2 f1; . . . ;Ng by a transformation matrix T as
sp ¼ T tðxp � E½xÞ; ð3Þwhere m6 n, T ¼ ðw1; . . . ;wmÞ and the vector wj
is the eigenvector corresponding to the jth larg-est eigenvalue of the sample covariance matrixC ¼ ð1=NÞ
PNp¼1 ðxp � E½xÞðxp � E½xÞT, such that
Cwk ¼ kkwk. The m principal axes T are ortho-
normal axes onto which the retained variance
under projection is maximal. One property of
PCA is that projection onto the principal sub-
space minimizes the squared reconstruction errorPNp¼1 kxp � x̂xk2. The optimal linear reconstruction
of x̂x is given by x̂x ¼ Tsp þ E½x, where sp ¼T tðxt � E½xÞ, and the orthogonal columns of Tspan the space of the principal m eigenvectors
of C .We consider a PCA mixture model which
combines the above two models (Eqs. (1) and (3))
in a way that the component density of the mixture
model can be estimated on the PCA transformed
space as
P ðxÞ ¼XKk¼1
P ðxjk; hkÞP ðkÞ; ð4Þ
P ðxjk; hkÞ ¼ ðskjk; hkÞ; ð5Þ
where sk ¼ WTk ðx� lkÞ. The PCA feature vectors
sk are decorrelated due to the orthogonality of thetransform matrix W k. Thus, its covariance matrix
RSk ¼ E½sksTk is a diagonal matrix whose diagonalelements correspond to the principal eigenvalues.
Next, the conditional density function P ðskjk; hkÞof the PCA feature vectors in the kth cluster can besimplified as
P ðskjk; hkÞ ¼1
ð2pÞm=2jRSk j1=2
� exp�� 12sTk R
S�1
k sk
�
¼Ymj¼1
1
ð2pÞ1=2k1=2k;j
exp
(�
s2j2kk;j
); ð6Þ
where s ¼ fs1; . . . ; smg, and fkk;1; . . . ; kk;mg are them dominant eigenvalues of the feature covariancematrix RS
k in the kth cluster. The proposed model,
which has no Gaussian error term, can be con-
sidered as a simplified form of the Tipping and
Bishop model (Tipping and Bishop, 1999).
Fig. 1 illustrates a PCA mixture model where
the number of mixture components K ¼ 2, thedimension of feature vectors m ¼ 2, the line seg-ments in each cluster represent the two column
vectors w1 and w2, and the intersection of two linesegments represents a mean vector E½x.The parameters of a mixture model can be es-
timated by an EM algorithm, which maximizes the
likelihood (Dempster et al., 1977). The EM algo-
rithm for the proposed model can be easily de-rived. Each iteration consists of two steps: an
expectation step (E-step) followed by a maximi-
zation step (M-step). Each step is run for each
mixture component. The EM Algorithm starts its
run after the parameters are initialized, and stops
when the density undergoes no further changes.
(1) E-step: Given the feature data set X and theparameters HðtÞ of the mixture model at the tthiteration, we estimate the posterior distribution
Pðzjx;HðtÞÞ using
Pðzjx;HðtÞÞ ¼ P ðxjz;HðtÞÞP ðzÞPKk¼1 P ðxjk;HðtÞÞP ðkÞ
; ð7Þ
where Pðxjz;HðtÞÞ is calculated by Eqs. (5) and (6).
![Page 4: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/4.jpg)
2818 H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821
(2) M-step: Next, the new means lX ðtþ1Þk and the
new covariance matrixes RX ðtþ1Þ
k of the kth mixturecomponent are obtained by the following update
formula.
lX ðtþ1Þ
k ¼PN
p¼1 P ðkjxp;HðtÞÞxpPN
p¼1 P ðkjxp;HðtÞÞ
; ð8Þ
RX ðtþ1Þ
k ¼PN
p¼1 P ðkjxp;HðtÞÞðxp � lX ðtÞ
k ÞTðxp � lX ðtÞk ÞPN
p¼1 P ðxpjHðtÞÞ:
ð9Þ
The new eigenvalue parameters kðtþ1Þk;j and the new
eigenvector (PCA basis) parameters wk;j are ob-
tained by selecting the largest m eigenvalues in theeigenvector computation (PCA computation) as
RX ðtþ1Þ
k wk;j ¼ kðtþ1Þk;j wk;j: ð10Þ
The mixing parameters PðzpÞ can be estimated asfollows:
P ðkÞ ¼ 1
N
XNp¼1
P ðkjxp;HðtÞÞ: ð11Þ
Repeat the two steps above until the sequence of
the estimations corresponding to the three sought
parameters do not change between the iterations
(convergence).
2.2. LDA mixture model
LDA has one transformation matrix over all
classes. This property degrades the performance of
LDA because only one transformation matrix is
not enough for the classification of complex data
with many classes with high variations. To over-
come this limitation, we propose to use LDA
mixture model that uses several transformation
matrices over all classes. LDA mixture modelpartitions the set of all classes into an appropriate
number of clusters and applies LDA to each
cluster, independently.
Specifically, we apply the PCA mixture model
to the set of means mi of each class with K mixturecomponents. Then, for the kth mixture compo-nent, we obtain a cluster mean ck, a transforma-
tion matrix Tk, and a diagonal matrix Vk with
eigenvalues, where Vk is a diagonal matrix whose
diagonal element is eigenvalues kk;j which is the jthlargest eigenvalue of the covariance matrix. In this
case, the probabilistic covariance matrix for the
kth mixture component is TkVkTtk. Using this re-
sult, we get the between-class scatter matrix for the
k mixture component (which corresponds to thekth cluster) as
SBk ¼ TkVkTtk; ð12Þ
and a within-scatter matrix as
SWk ¼Xl2Lk
1
nl
Xx2Cl
ðx�mlÞðx�mlÞt; ð13Þ
where k ¼ 1; 2; . . . ;K, l ¼ 1; 2; . . . ; c, and Lk is a set
of class labels belonging to the kth mixture com-ponent that is determined by
Lk ¼ j j j ¼ argmini
ðmi � ckÞS�1Bj ðmi � ckÞt
� �:
Based on SWk and SBk , the transformation ma-
trix W k for the kth mixture component is deter-mined so as to maximize the criterion function
JkðWÞ ¼ jW tSBW jjW tSWkW j : ð14Þ
The columns of optimal W k are the generalizedeigenvectors wkj that correspond to the largest
eigenvalues in
SBwkj ¼ kkjSWkwkj : ð15Þ
2.3. Model selection
When we use a kind of mixture models, we have
a model selection issue, i.e. how many mixture
components we should use. One strategy to find
the proper number of mixture components iscross-validation. We can set aside some of a
training set as a validations set, and evaluate the
performance for the validation set over different
number of mixture components. We can use the
number of mixture components which give the
best performance for the validation set.
![Page 5: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/5.jpg)
H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821 2819
3. Simulation results and discussion
We took a partial set of PSL database obtained
from the MPEG7 community (Wang and Tan,
2000). PSL database consists of normalized imagesof 271 people, where images of some people have
lighting variations and images of the other people
have pose variations. Among all images, we se-
lected images of 133 people with lighting varia-
tions. In this partial set, there are five images for
each person, consisting of normal, smiling, angry
expressions, left-illumination, and right-illumina-
tion images. Since the number of images for eachclass is too small, we used 5-fold cross-validation
to test the classification performances. Fig. 2
shows some images used in the simulation.
We applied LDA and LDA mixture model to
the data. For comparison, we also applied PCA
(eigenface method (Turk and Pentland, 1991)) and
PCA mixture model. We used the following rec-
ognition method for each method, which is similarto the methods used in synthetic data classifica-
tion. For PCA and LDA, we transform the
training data and the test data x by a transfor-mation matrix (T for PCA, W for LDA), and as-
sign the test data x to the class CPCAðLDAÞ of the
Fig. 2. Some face images used in the simulation.
transformed training data that is nearest to the
transformed test data as
CPCA ¼ L argminxr
kðx� xrÞTk�
; ð16Þ
and
CLDA ¼ L argminxr
kðx� xrÞWk�
; ð17Þ
where xr is a sample in the training set and LðxrÞindicates the class label of a sample xr. For PCA
mixture model and LDA mixture model, we
transform the test data x and the training data xr
by a transformation matrix (Tk for PCA mixture
model, W k for LDA mixture model) of corre-
sponding mixture component, and assign the
data to the class CPMðLMÞ of the corresponding
transformed training data that is nearest to the
transformed test data by the corresponding trans-
formation matrix Tk or W k as
CPM ¼ L argminxr
kðx� xrÞT IðxrÞk�
; ð18Þ
and
CLM ¼ L argminxr
kðx� xrÞW IðxrÞk�
; ð19Þ
where xr is a sample in the training set, LðxrÞ in-dicates the class label of a sample xr, and IðxrÞindicates the mixture component label, to which
the training sample xr belongs, determined by
IðxrÞ ¼ argmini
ðmi
�� ckÞS�1
Bj ðmi � ckÞt�:
Even though more than two mixture compo-
nents have been taken, we do not observe any
significant improvement in classification perfor-
mance. So, we used only two mixture components
for learning the PCA mixture model for each classin PCA mixture model and LDA mixture model.
Fig. 3 plots the classification errors according to a
different number of features for the test data when
each method is applied. The number of features
and classification errors in the best case are shown
in Table 1. From Fig. 3, we note that the LDA
mixture model provides a good classification
performance as the number of features taken
![Page 6: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/6.jpg)
0 10 20 30 40 50 600.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of features
Cla
ssifi
catio
n er
ror
rate
PCAPM LDALM
Fig. 3. Classification errors vs. the number of features.
Table 1
The best performance and the corresponding number of fea-
tures
Methods Number of features Classification error
(%)
PCA 92 22.41
PCA mixture 36 18.65
LDA 84 21.50
LDA mixture 54 12.78
2820 H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821
increases. For a small number of features, PCAmixture model outperforms LDA mixture model,
but LDA mixture model outperforms PCA mix-
ture model when the number of features used ex-
ceeds 35. We also note that the classification
performance of LDA is better than that of PCA,
and both PCA mixture model and LDA mixture
model outperform PCA and LDA, respectively.
We manually inspected the automatically foundmembers of the two clusters. We could not find a
special pattern except that similar face images were
most likely to be in the same cluster.
4. Conclusion
We proposed LDA mixture model to overcomea limitation of LDA, which provided several
transformation matrices by taking one transfor-
mation matrix for each cluster. This modification is
trying to represent the underlying data represen-
tation more accurately by PCA mixture model and
to improve the data discrimination by the use of
several clusterwise transformation matrices. Lim-ited simulation results showed that LDA mixture
model outperformed PCA, LDA, and PCA mix-
ture model for enough number of features. For the
complex data consisting of many classes with high
variations such as face images, LDAmixture model
can effectively be used as an alternative of LDA.
Acknowledgements
The authors would like to thank the Ministry
of Education of Korea for its financial support
toward the Electrical and Computer Engineering
Division at POSTECH through its BK21 program.
This research was also supported by Brain Science
and Engineering Research Program supported bythe Ministry of Science and Technology of Korea.
References
Belhumeur, P., Hespanha, J., Kriegman, D., 1997. Eigenfaces
vs. fisherfaces: Class specific linear projection. IEEE Trans.
PAMI 19 (7), 711–720.
Brunelli, R., Poggio, T., 1993. Face recognition: Feature versus
templates. IEEE Trans. Pattern Anal. Machine Intell. 15
(10), 1042–1052.
Chellappa, R., Wilson, C., Sirohey, S., 1995. Human and
machine recognition of faces: A survey. Proc. IEEE 83 (5),
705–740.
Dempster, P., Laird, N., Rubin, D., 1977. Maximum likelihood
from incomplete data via the EM algorithm. J. Roy. Statist.
Soc. Ser. B 39 (4), 1–38.
Duc, B., Fischer, S., Bigun, J., 1999. Face authentication with
Gabor information on deformable graphs. IEEE Trans.
Image Process. 8 (4), 504–516.
Duda, R., Hart, P., Stork, D., 2001. Pattern Classification.
Wiley, New York.
Etemad, K., Chellappa, R., 1997. Discriminant analysis for
recognition of human face images. J. Opt. Soc. Amer. A 14
(8), 1724–1733.
Hotelling, H., 1933. Analysis of a complex statistical variables
into principal components. J. Educ. Psychol. 24, 417–441.
Jacobs, R., Jordan, M., Nowlan, S., Hinton, G., 1991. Adaptive
mixtures of local experts. Neural Comput. 3, 79–87.
Jolliffe, I.T., 1896. Principal Component Analysis. Springer-
Verlag, New York.
![Page 7: Face recognition using LDA mixture model](https://reader037.fdocuments.net/reader037/viewer/2022100502/575022351a28ab877ea3a24d/html5/thumbnails/7.jpg)
H.-C. Kim et al. / Pattern Recognition Letters 24 (2003) 2815–2821 2821
Jordan, M., Jacobs, R., 1994. Hierarchical mixtures of experts
and the EM algorithm. Neural Comput. 6 (5), 181–214.
Sirovich, L., Kirby, M., 1987. Low-dimensional procedure for
the characterization of human faces. J. Opt. Soc. Amer., A 4
(3), 519–524.
Tipping, M., Bishop, C., 1999. Mixtures of probabilistic
principal component analyzers. Neural Comput. 11, 443–
482.
Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J.
Cognitive Neurosci. 3 (1).
Wang, L., Tan, T.K., 2000. Experimental results of face
description based on the 2nd-order eigenface method,
ISO/MPEG m6001, Geneva.
Wiskott, L., Fellous, J.M., Kruger, N., Malsburg, C., 1997.
Face recognition by elastic bunch graph matching. IEEE
Trans. Pattern Anal. Machine Intell. 19 (7), 775–780.