Le Satoh Unsupervised Face Annotation Icdm08

10
Unsupervised Face Annotation by Mining the Web Duy-Dinh Le National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, JAPAN 101-8430 [email protected] Shin’ichi Satoh National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, JAPAN 101-8430 [email protected] Abstract Searching for images of people is an essential task for image and video search engines. However, current search engines have limited capabilities for this task since they rely on text associated with images and video, and such text is likely to return many irrelevant results. We propose a method for retrieving relevant faces of one person by learn- ing the visual consistency among results retrieved from text- correlation-based search engines. The method consists of two steps. In the first step, each candidate face obtained from a text-based search engine is ranked with a score that measures the distribution of visual similarities among the faces. Faces that are possibly very relevant or irrelevant are ranked at the top or bottom of the list, respectively. The sec- ond step improves this ranking by treating this problem as a classification problem in which input faces are classified as ’person-X ’ or ’non-person-X ’; and the faces are re-ranked according to their relevant score inferred from the classi- fier’s probability output. To train this classifier, we use a bagging-based framework to combine results from multiple weak classifiers trained using different subsets. These train- ing subsets are extracted and labeled automatically from the rank list produced from the classifier trained from the previous step. In this way, the accuracy of the ranked list increases after a number of iterations. Experimental results on various face sets retrieved from captions of news photos show that the retrieval performance improved after each it- eration, with the final performance being higher than those of the existing algorithms. 1. Introduction With the rapid growth of digital technology, large image and video databases have become more available than ever to users. This trend has shown the need for effective and ef- ficient tools for indexing and retrieving based on visual con- tent. A typical application is searching for a specific person by providing his or her name. Most current search engines use the text associated with images and video as significant clues for returning results. However, other un-queried faces and names may appear with the queried ones (Figure 1), and this significantly lowers the retrieval performance. One way to improve the retrieval performance is to take into account visual information present in the retrieved faces. This task is challenging for the following reasons: Large variations in facial appearance due to pose changes, illumination conditions, occlusions, and fa- cial expressions make face recognition difficult even with state-of-the-art techniques [1, 21, 2] (see example in Figure 2). The fact that the retrieved face set consists of faces of several people with no labels makes supervised and un- supervised learning methods inapplicable. We propose a method for solving the above problem. The main idea is to assume that there is visual consistency among the results returned from text-based search engines and this visual consistency is then learned through an in- teractive process. This method consists of two stages. In the first stage, we explore the local density of faces to iden- tify potential candidates for relevant faces 1 and irrelevant faces 2 . This stage reflects the fact that the facial images of the queried person tend to form dense clusters, whereas ir- relevant facial images are sparse since they look different from each other. For each face, we define a score to mea- sure the density of its neighbor set. This score is used to form a ranked list, in which faces with high-density scores are considered relevant and are put at the top. The above ranking method is weak since dense clusters have no guarantee of containing relevant faces. Therefore, a second stage is necessary to improve this ranked list. We model this problem as a classification problem in which in- put faces are classified as person-X (the queried person) 1 faces related to the queried person. 2 faces unrelated to the queried person. 2008 Eighth IEEE International Conference on Data Mining 1550-4786/08 $25.00 © 2008 IEEE DOI 10.1109/ICDM.2008.47 383 2008 Eighth IEEE International Conference on Data Mining 1550-4786/08 $25.00 © 2008 IEEE DOI 10.1109/ICDM.2008.47 383

description

 

Transcript of Le Satoh Unsupervised Face Annotation Icdm08

Page 1: Le Satoh Unsupervised Face Annotation Icdm08

Unsupervised Face Annotation by Mining the Web

Duy-Dinh LeNational Institute of Informatics2-1-2 Hitotsubashi, Chiyoda-ku

Tokyo, JAPAN [email protected]

Shin’ichi SatohNational Institute of Informatics2-1-2 Hitotsubashi, Chiyoda-ku

Tokyo, JAPAN [email protected]

Abstract

Searching for images of people is an essential task forimage and video search engines. However, current searchengines have limited capabilities for this task since they relyon text associated with images and video, and such textis likely to return many irrelevant results. We propose amethod for retrieving relevant faces of one person by learn-ing the visual consistency among results retrieved from text-correlation-based search engines. The method consists oftwo steps. In the first step, each candidate face obtainedfrom a text-based search engine is ranked with a score thatmeasures the distribution of visual similarities among thefaces. Faces that are possibly very relevant or irrelevant areranked at the top or bottom of the list, respectively. The sec-ond step improves this ranking by treating this problem as aclassification problem in which input faces are classified as’person-X’ or ’non-person-X’; and the faces are re-rankedaccording to their relevant score inferred from the classi-fier’s probability output. To train this classifier, we use abagging-based framework to combine results from multipleweak classifiers trained using different subsets. These train-ing subsets are extracted and labeled automatically fromthe rank list produced from the classifier trained from theprevious step. In this way, the accuracy of the ranked listincreases after a number of iterations. Experimental resultson various face sets retrieved from captions of news photosshow that the retrieval performance improved after each it-eration, with the final performance being higher than thoseof the existing algorithms.

1. Introduction

With the rapid growth of digital technology, large imageand video databases have become more available than everto users. This trend has shown the need for effective and ef-ficient tools for indexing and retrieving based on visual con-tent. A typical application is searching for a specific person

by providing his or her name. Most current search enginesuse the text associated with images and video as significantclues for returning results. However, other un-queried facesand names may appear with the queried ones (Figure 1), andthis significantly lowers the retrieval performance. One wayto improve the retrieval performance is to take into accountvisual information present in the retrieved faces. This taskis challenging for the following reasons:

• Large variations in facial appearance due to posechanges, illumination conditions, occlusions, and fa-cial expressions make face recognition difficult evenwith state-of-the-art techniques [1, 21, 2] (see examplein Figure 2).

• The fact that the retrieved face set consists of faces ofseveral people with no labels makes supervised and un-supervised learning methods inapplicable.

We propose a method for solving the above problem.The main idea is to assume that there is visual consistencyamong the results returned from text-based search enginesand this visual consistency is then learned through an in-teractive process. This method consists of two stages. Inthe first stage, we explore the local density of faces to iden-tify potential candidates for relevant faces1 and irrelevantfaces2. This stage reflects the fact that the facial images ofthe queried person tend to form dense clusters, whereas ir-relevant facial images are sparse since they look differentfrom each other. For each face, we define a score to mea-sure the density of its neighbor set. This score is used toform a ranked list, in which faces with high-density scoresare considered relevant and are put at the top.

The above ranking method is weak since dense clustershave no guarantee of containing relevant faces. Therefore,a second stage is necessary to improve this ranked list. Wemodel this problem as a classification problem in which in-put faces are classified as person-X (the queried person)

1faces related to the queried person.2faces unrelated to the queried person.

2008 Eighth IEEE International Conference on Data Mining

1550-4786/08 $25.00 © 2008 IEEE

DOI 10.1109/ICDM.2008.47

383

2008 Eighth IEEE International Conference on Data Mining

1550-4786/08 $25.00 © 2008 IEEE

DOI 10.1109/ICDM.2008.47

383

Page 2: Le Satoh Unsupervised Face Annotation Icdm08

Figure 1. A news photo and its caption. Ex-tracted faces are shown on the top. Thesefaces might be returned for the query ofperson-Bush.

or non-person-X (the un-queried person). The faces areranked according to a relevancy score that is inferred fromthe classifier’s probability output. Since annotation data isnot available, the rank list from the previous step is used toassign labels for a subset of faces. This subset is then usedto train a classifier using supervised methods such as sup-port vector machines (SVM). The trained classifier is usedto re-rank faces in the original input set. This step is re-peated a number of times to get the final ranked list. Sinceautomatically assigning labels from the ranked list is not re-liable, the trained classifiers are weak. To obtain the finalstrong classifier, we use the idea of ensemble learning [6] inwhich weak classifiers trained on different subsets are com-bined to improve the stability and classification accuracy ofsingle classifiers. The learned classifier can be further usedfor recognizing new facial images of the queried person.

The second stage improves the ranked list and recogni-tion performance for the following reasons:

• Supervised learning methods, such as SVM, providea strong theoretical background for finding the opti-mal decision boundary even with noisy data. Further-more, recent studies [20, 17] suggest that SVM clas-sifiers provide probability outputs that are suitable forranking.

Figure 2. Large variations in facial expres-sions, poses, illumination conditions and oc-clusions making face recognition difficult.Best viewed in color.

• The bagging framework helps to leverage noises in theunsupervised labeling process.

Our contribution is two-fold:

• We propose a general framework to boost the face re-trieval performance of text-based search engines by vi-sual consistency learning. The framework seamlesslyintegrates data mining techniques such as supervisedlearning and unsupervised learning based on bagging.Our framework requires only a few parameters andworks stably.

• We demonstrate its feasibility with a practical webmining application. A comprehensive evaluation on alarge face dataset of many people was carried out andconfirmed that our approach is promising.

2. Related Work

There are several approaches for re-ranking and learn-ing models from web images. Their underlying assump-tion is that text-based search engines return a large frac-tion of relevant images. The challenge is how to modelwhat is common in the relevant images. One approachis to model this problem in a probabilistic framework inwhich the returned images are used to learn the parame-ters of the model. For examples, as described by Fergus etal. [12], objects retrieved using an image search engine arere-ranked by extending the constellation model. Anotherproposal, described in [15], uses a non-parametric graphi-cal model and an interactive framework to simultaneouslylearn object class models and collect object class datasets.The main contribution of these approaches is probabilisticmodels that can be learned with a small number of trainingimages. However, these models are complicated since they

384384

Page 3: Le Satoh Unsupervised Face Annotation Icdm08

require several hundred parameters for learning and are sus-ceptible to over-fitting. Furthermore, to obtain robust mod-els, a small amount of supervision is required to select seedimages.

Another study [4, 3] proposed a clustering-based methodfor associating names and faces in news photos. To solvethe problem of ambiguity between several names and oneface, a modified k-means clustering process was used inwhich faces are assigned to the closest cluster (each clus-ter corresponding to one name) after a number of iterations.Although the result was impressive, it is not easy to apply itto our problem since it is based on a strong assumption thatrequires a perfect alignment when a news photo only hasone face and its caption only has one name. Furthermore,a large number of irrelevant faces (more than 12%) have tobe manually eliminated before clustering.

A graph-based approach was proposed by Ozkan andDuygulu [16], in which a graph is formed from faces asnodes, and the weights of edges linked between nodes arethe similarity of faces, is closely related to our problem.Assuming that the number of faces of the queried person islarger than that of others and that these faces tend to formthe most similar subset among the set of retrieved faces,this problem is considered equal to the problem of findingthe densest subgraph of a full graph; and can therefore besolved by taking an available solution [9]. Although, exper-imental results showed the effectiveness of this method, it isstill questionable whether the densest subgraph intuitivelydescribes most of the relevant faces of the queried personand it is easy to extend for the ranking problem. Further-more, choosing an optimal threshold to convert the initialgraph into a binary one is difficult and rather ad hoc due tothe curse of dimensionality.

An advantage of the methods [4, 3, 16] is they are fullyunsupervised. However, a disadvantage is that no modelis learned for predicting new images of the same category.Furthermore, they are used for performing hard categoriza-tion on input images that are in applicable for re-ranking.The balance of recall and precision was not addressed. Typ-ically, these approaches tend to ignore the recall to obtainhigh precision. This leads to the reduction in the number ofcollected images.

Our approach combines a number of advances over theexisting approaches. Specifically, we learn a model for eachquery from the returned images for purposes such as re-ranking and predicting new images. However, we used anunsupervised method to select training samples automati-cally, which is different from the methods proposed by Fer-gus et al. and Li et al. [12, 15]. This unsupervised methodis different from the one by Ozkan and Duygulu [16] in themodeling of the distribution of relevant images. We usedensity-based estimation rather than the densest graph.

3 Proposed Framework

Given a set of images returned by any text-based searchengine for a queried person (e.g. ’George Bush’), we per-form a ranking process and learning of person X’s modelas follows:

• Step 1: Detect faces and eye positions, and then per-form face normalizations.

• Step 2: Compute an eigenface space and project theinput faces into this subspace.

• Step 3: Estimate the ranked list of these faces usingRank-By-Local-Density-Score.

• Step 4: Improve this ranked list using Rank-By-Bagging-ProbSVM.

Steps 1 and 2 are typical for any face processing system,and they are described in section 4.2. The algorithms usedin Steps 3 and 4 are described in section 3.1 and section 3.2,respectively. Figure 3 illustrates the proposed framework.

3.1 Ranking by Local Density Score

Figure 4. An example of faces retrieved forperson-Donald Rumsfeld. Irrelevant facesare marked with a star. Irrelevant faces mightform several clusters, but the relevant facesform the largest cluster.

Among the faces retrieved by text-based search enginesfor a query of person-X , as shown in Figure 4, relevantfaces usually look similar and form the largest cluster. Oneapproach of re-ranking these faces is to cluster based on vi-sual similarity. However, to obtain ideal clustering results isimpossible since these faces are high dimensional data andthe clusters are in different shapes, sizes, and densities. In-stead, a graph-based approach was proposed by Ozkan andDuygulu [16] in which the nodes are faces and edge weights

385385

Page 4: Le Satoh Unsupervised Face Annotation Icdm08

Figure 3. The proposed framework for re-ranking faces returned by text-based search engines.

are the similarities between two faces. With the observationthat the nodes (faces) of the queried person are similar toeach other and different from other nodes in the graph, thedensest component of the full graph the set of highly con-nected nodes in the graph will correspond to the face of thequeried person. The main drawback of this approach is itneeds a threshold to convert the initial weighted graph to abinary graph. Choosing this threshold in high dimensionalspaces is difficult since different persons might have differ-ent optimal thresholds.

We use the idea of density-based clustering described byEster et al. and Breunig et al. [11, 7] to solve this problem.Specifically, we define the local density score (LDS) of apoint p (i.e. a face) as the average distance to its k-nearestneighbors.

LDS(p, k) =

∑q∈R(p,k) distance(p, q)

k

where R(p, k) is the set of k - neighbors of p, anddistance(p, q) is the similarity between p and q.

Since faces are represented in high dimensional featurespace, and face clusters might have different sizes, shapes,and densities, we do not directly use the Euclidean distancebetween two points in this feature space for distance(p, q).Instead, we use another similarity measure defined by thenumber of shared neighbors between two points. The effi-ciency of this similarity measure for density-based cluster-ing methods was described in [10].

distance(p, q) =|R(q, k) ∩ R(p, k)|

k

Therefore

LDS(p, k) =

∑q∈R(p,k) |R(q, k) ∩ R(p, k)|

k2

A high value of LDS(p, k) indicates a strong associationbetween p and its neighbors. Therefore, we can use thislocal density score to rank faces. Faces with higher scoresare considered to be potential candidates that are relevant toperson-X , while faces with lower scores are considered asoutliers and thus are potential candidates for non-person-X .Algorithm 1 describes these steps.

Algorithm 1: Rank-By-Local-Density-ScoreStep 1: For each face p, compute LDS(p, k),where k is the number of neighbors of pand is the input of the ranking process.Step 2: Rank these faces using LDS(p, k)(The higher the score the more relevant).

3.2 Ranking by Bagging of SVM Classi-fiers

One limitation of the local density score based rankingis it cannot handle faces of another person strongly associ-ated in the k-neighbor set (for example, many duplicates).Therefore, another step is proposed for handling this case.As a result, we have a model that can be used for both re-ranking current faces and predicting new incoming faces.

The main idea is to use a probabilistic model to measurethe relevancy of a face to person-X , P (person−X |face).Since the labels are not available for training, we use theinput rank list found from the previous step to extract a sub-set of faces lying at the top and bottom of the ranked list toform the training set. After that, we use SVM with prob-abilistic output [17] implemented in LibSVM [8] to learnthe person-X model. This model is applied to faces of theoriginal set, and the output probabilistic scores are used tore-rank these faces. Since it is not guaranteed that faces ly-ing at two ends of the input rank list correctly correspond tothe faces of person-X and faces of non person-X , we adoptthe idea of a bagging framework [6] in which randomly se-lecting subsets to train weak classifiers, and then combiningthese classifiers help reduce the risk of using noisy trainingsets.

The details of the Rank-By-Bagging-ProbSVM-InnerLoop method, improving an input rank list bycombining weak classifiers trained from subsets annotatedby that rank list are described in Algorithm 2.

Given an input ranked list, Rank-By-Bagging-ProbSVM-InnerLoop is used to improve this list. We repeat the processa number of times whereby the ranked list output from theprevious step is used as the input ranked list of the next

386386

Page 5: Le Satoh Unsupervised Face Annotation Icdm08

Algorithm 2: Rank-By-Bagging-ProbSVM-InnerLoopStep 1: Train a weak classifier, hi.Step 1.1: Select a set Spos including p% of top ranked facesand then randomly select a subset S∗

pos from Spos.Label faces in S∗

pos as positive samples.Step 1.2: Select a set Sneg including p% of bottom rankedfaces and then randomly select a subset S∗

neg from Sneg .Label faces in S∗

neg as negative samples.Step 1.3: Use S∗

pos and S∗neg to train a weak

classifier, hj , using LibSVM [8] with probability outputs.Step 2: Compute ensemble classifier Hi =

∑ij=1 hj .

Step 3: Apply Hi to the original face set and form therank list, Ranki, using the output probabilistic scores.Step 4: Repeat steps 1 to 3until Dist2RankList(Ranki−1, Ranki) <= ε.Step 5: Return Hi =

∑ij=1 hj .

Algorithm 3: Rank-By-Bagging-ProbSVM-OuterLoopStep 1: Rankcur =Rank-By-Bagging-ProbSVM-InnerLoop(Rankprev).Step 2: dist = Dist2RankList(Rankprev, Rankcur).Step 3: Rankfinal = Rankcur.Step 4: Rankprev = Rankcur.Step 5: Repeat steps 1 to 4until dist <= ε.Step 6: Return Rankfinal.

step. In this way, the iterations significantly improve thefinal ranked list. The details are described in Algorithm 3.

To determine the number of iterations of Rank-By-Bagging-ProbSVM-InnerLoop and Rank-By-Bagging-ProbSVM-OuterLoop, we use the Kendall − tau dis-tance [13], which is a metric that counts the number of pair-wise disagreements between two lists. The larger the dis-tance, the more dissimilar the two lists are. The Kendall−tau distance between two lists, τ1 and τ2, is defined as fol-lows:

K(τ1, τ2) =∑

(i,j)∈P

Ki,j(τ1, τ2)

where P is the set of unordered pairs of distinct elementsin τ1 and τ2. Ki,j(τ1, τ2) = 0 if i and j are in the sameorder in τ1 and τ2, and Ki,j(τ1, τ2) = 1 if i and j are in theopposite order in τ1 and τ2.

Since the maximum value of K(τ1, τ2) is N(N − 1)/2,where N is the number of members of the list, the normal-ized Kendall tau distance can be written as follows:

Knorm(τ1, τ2) =K(τ1, τ2)

N(N − 1)/2.

Using this measure for checking when the loops stopmeans that if the ranked list does not change significantlyafter a number of iterations, it is reasonable to stop.

4 Experiments

4.1 Dataset

We used the dataset described by Berg et al. [4] for ourexperiments. This dataset consists of approximately half amillion news photos and captions from Yahoo News col-lected over a period of roughly two years. This dataset isbetter than datasets collected from image search enginessuch as Google that usually limit the total number of re-turned images to 1,000. Furthermore, it has annotations thatare valuable for evaluation of methods. Note that these an-notations are used for evaluation purpose only. Our methodis fully unsupervised, so it assumes the annotations are notavailable at running time.

Only frontal faces were considered since current frontalface detection systems [19] work in real time and have ac-curacies exceeding 95%. 44,773 faces were detected andnormalized to the size of 86×86 pixels.

We selected fifteen government leaders, includingGeorge W. Bush (US), Vladimir Putin (Russia), ZiangJemin (China), Tony Blair (UK), Junichiro Koizumi(Japan), Roh Moo-hyun (Korea), Abdullah Gul (Turkey),and other key individuals, such as John Paul II (the FormerPope) and Hans Blix (UN), because their images frequentlyappear in the dataset [16]. Variations in each person’s namewere collected. For example, George W. Bush, PresidentBush, U.S. President, etc., all refer to the current U.S. pres-ident.

We performed simple string search in captions to checkwhether a caption contained one of these names. The facesextracted from the corresponding image associated with thiscaption were returned. The faces retrieved from the differ-ent name queries were merged into one set and used as inputfor ranking.

Figure 5 shows the distribution of retrieved faces fromthis method and the corresponding number of relevant facesfor these fifteen individuals. In total, 5,603 faces were re-trieved in which 3,374 faces were relevant. On average, theaccuracy was 60.22%.

4.2 Face Processing

We used an eye detector to detect the positions of theeyes of the detected faces. The eye detector, built with thesame approach as that of Viola and Jones [19], had an ac-curacy of more than 95%. If the eye positions were notdetected, predefined eye locations were assigned. The eyepositions were used to align faces to a predefined canonicalpose.

To compensate for illumination effects, the subtractionof the bestfit brightness plane followed by histogram equal-ization was applied. This normalization process is shown in

387387

Page 6: Le Satoh Unsupervised Face Annotation Icdm08

Figure 5. Distribution of retrieved faces andrelevant faces of 16 individuals used in ex-periments. Due to space limitation, bars cor-responding to George Bush (2,282 vs. 1,284)and Tony Blair (682 vs. 323) were cut-off atthe upper limit of the graph.

Figure 6.We then used principle component analysis [18] to re-

duce the number of dimensions of the feature vector for facerepresentation. Eigenfaces were computed from the origi-nal face set returned using the text-based query method. Thenumber of eigenfaces used to form the eigen space was se-lected so that 97% of the total energy was retained [5]. Thenumber of dimensions of these feature spaces ranged from80 to 500.

Figure 6. Face normalization. (top) faces withdetected eyes, (bottom) faces after normal-ization process.

4.3 Evaluation Criteria

We evaluated the retrieval performance with measuresthat are commonly used in information retrieval, such asprecision, recall, and average precision. Given a queriedperson and letting Nret be the total number of faces re-turned, Nrel the number of relevant faces, and Nhit the totalnumber of relevant faces, recall and precision can be calcu-

lated as follows:

Recall =Nrel

Nhit

Precision =Nrel

Nret

Precision and recall are only used to evaluate the qualityof an unordered set of retrieved faces. To evaluate rankedlists in which both recall and precision are taken into ac-count, average precision is usually used. The average pre-cision is computed by taking the average of the interpolatedprecision measured at the 11 recall levels of 0.0, 0.1, 0.2, ...,1.0.

The interpolated precision pinterp at a certain recall levelr is defined as the highest precision found for any recalllevel q ≥ r:

pinterp = maxr′≥rp(r′)

In addition, to evaluate the performance of multiplequeries, we used mean average precision, which is the meanof average precisions computed from queries3.

4.4 Parameters

The parameters of our method include:

• p: the fraction of faces at the top and bottom of theranked list that are used to form a positive set Spos andnegative set Sneg for training weak classifiers in Rank-By-Bagging-ProbSVM-InnerLoop. We empirically se-lected p = 20% (i.e 40% samples of the rank list wereused) since a larger p will increase the number of incor-rect labels, and a smaller p will cause over-fitting. Inaddition, S∗

pos consists of 0.7× |Spos| samples that areselected randomly with replacement from Spos. Thissampling strategy is adopted from the bagging frame-work [6]. The same setting was used for S∗

neg .

• ε: the maximum Kendall tau distance Knorm(τ1, τ2)between two rank lists τ1 and τ2. This value is used todetermine when the inner loop and the outer loop stop.We set ε = 0.05 for balancing between accuracy andprocessing time. Note that a smaller ε requires moreiterations, making the system’s speed slower.

• kernel: the kernel type is used for the SVM. The de-fault is a linear kernel that is defined as: k(x, y) =x′∗y. We have tested other kernel types such as RBF orpolynomial, but the performance did not change much.Therefore, we used the linear kernel for simplicity.

3http://trec.nist.gov/pubs/trec10/appendices/measures.pdf

388388

Page 7: Le Satoh Unsupervised Face Annotation Icdm08

4.5 Results

4.5.1 Performance Comparison with Existing Ap-proaches

We performed a comparison between our proposed methodwith other existing approaches.

• Text Based Baseline (TBL): Once faces correspondingwith images whose captions contain the query nameare returned, they are ranked in time order. This is arather naive method in which no prior knowledge be-tween names and faces is used.

• Distance-Based Outlier (DBO): We adopted the ideaof distance-based outliers detection for ranking [14].Given a threshold dmin, for each point p, we countedthe number of points q so that dist(p, q) ≤ dmin,where dist(p, q) is the Euclidean distance between pand q in the feature space mentioned in section 4.2.This number was then used as the score to rank faces.We selected a range of dmin values for experiments:dmin = 10, 15, 20, ..., 90.

• Densest Sub-Graph based Method (DSG): We re-implemented the densest sub-graph based method [16]for ranking. Once the densest subgraph was found af-ter an edge elimination process, we counted the num-ber of surviving edges of each node (i.e face) and usedthis number as the ranking score. To form the graph,the Euclidean distance dist(p, q) was used to assignthe weight for the edge linked between node p andnode q. DSG require a threshold θ to convert theweighted graph to the binary graph before searchingfor the densest subgraph. We selected a range of θvalues that are the same as the values used in DBO:θ = 10, 15, 20, ..., 90.

• Local Density Score (LDS): This is the first stage ofour proposed method. It requires the input value k tocompute the local density score. Since we do not knowthe number of returned faces from text-based searchengines, we used another input value fraction definedas the fraction of neighbors and estimated k by the for-mula: k = fraction ∗ N , where N is the number ofreturned faces. We used a range of fraction valuesfor experiments: fraction = 5%, 10%, 15%, ..., 50%.For a large number of returned faces, we set k to themaximum value of 200: k = 200.

• Unsupervised Ensemble Learning Using Local Den-sity Score (UEL-LDS): This is a combination of rank-ing by local density scores and then the ranked list isused for training a classifier to boost the rank list.

• Supervised Learning (SVM-SUP): We randomly se-lected a portion p of the data with annotations to trainthe classifier; and then used this classifier to re-rankthe remaining faces. This process was repeated fivetimes and the average performance was reported. Weused a range of portion p values for experiments: p =1%, 2%, 3%, ..., 5%.

Figure 7. Performance comparison of meth-ods. Due to different settings, performancesare superimposed for better evaluation.

Figure 7 shows a performance comparison of these meth-ods. Our proposed methods (LDS and UEL-LDS) out-perform other unsupervised methods such as TBL, DBOand DSG. Furthermore, the performance of the DBO andDSG methods are sensitive to the distance threshold, whilethe performance of our proposed method is less sensitive.It confirms that the similarity measure using shared near-est neighbors is reliable for estimation of the local den-sity score. The performance of UEL-LDS is slightly bet-ter than LDS since the training sets labeled automaticallyfrom the ranked list are noisy. However, UEL-LDS im-proves significantly even when the performance of LDS ispoor. These performances are worse than that of SVM-SUPusing a small number of labeled samples.

Figure 8 shows an example of the top 50 faces rankedusing the TBL, DBO, DSG and LDS methods. The perfor-mance of DBO is poor since a low threshold is used. Thisranks irrelevant faces that are near duplicates (rows 2 and 3in Figure 8(b)) higher than relevant faces. This explains thesame situation with DSG.

4.5.2 Performance of Ensemble Classifiers

In Figure 9, we show the performance of five single clas-sifiers and that of five ensemble classifiers. The ensemble

389389

Page 8: Le Satoh Unsupervised Face Annotation Icdm08

MethodPrecisionat top 20 Recall Precision

GoogleSE 79.33 100.00 57.08UEL-LDS 89.00 72.50 76.41SVM-SUP-05 85.00 73.14 76.46SVM-SUP-10 90.67 74.94 78.30

Table 1. Comparison of different methods onthe new test set returned by Google ImageSearch Engine.

classifier k is formed by combining single classifiers from 1to k. It clearly indicates that the ensemble classifier is morestable than single weak classifiers.

4.5.3 New Face Annotation

We conducted another experiment to show the effectivenessof our approach in which learned models are used to anno-tate new faces of other databases. We used each name in thelist as a query to obtain the top 500 images from the GoogleImage Search Engine (GoogleSE). Next, these images wereprocessed using the steps described in section 4.2: extract-ing faces, detecting eyes and doing normalization. We pro-jected these faces to the PCA subspace trained for that nameand used the learned model to re-rank faces.

There were 4,103 faces (including false positives - non-faces detected as faces) detected from 7,500 returned im-ages. We manually labeled these faces and there were 2,342relevant faces. On average, the accuracy of the GoogleSE is57.08%.

In Table 1, we compare the performance of the methods.The performance of UEL-LDS was obtained by runningthe best system, which is shown as the peak of the UEL-LDS curve in Figure 7. The performances of SVM-SUP-05and SVM-SUP-10 correspond to the supervised systems (cf.section 4.5.1) that used p = 5% and p = 10% of the data setrespectively. We evaluated the performance by calculatingthe precision at the top 20 returned faces, which is com-mon for image search engines and recall and precision onall detected faces of the test set. UEL-LDS achieved com-parable performance to the supervised methods and outper-formed the baseline GoogleSE. The precision at the top 20of SVM-SUP-05 is poorer than that of UEL-LDS due to thesmall number of training samples. Figure 10 shows top 20faces ranked using these two methods.

5 Discussion

Our approach works fairly well for well known people,where the main assumption that text-based search engines

return a large fraction of relevant images is satisfied. Fig-ure 12 shows an example where this assumption is broken.Consequently, as shown in Figure 13, the model learned bythis set performed poorly in recognizing new faces returnedby GoogleSE. Our approach solely relies on the above as-sumption; therefore, it is not affected by the ranking of text-based search engines.

The iteration of bagging SVM classifiers does not guar-antee a significant improvement in performance. The aimof our future work is to study how to improve the quality ofthe training sets used in this iteration.

6 Conclusion

We presented a method for ranking faces retrieved us-ing text-based correlation methods in searches for a specificperson. This method learns the visual consistency amongfaces in a two-stage process. In the first stage, a relative den-sity score is used to form a ranked list in which faces rankedat the top or bottom of the list are likely to be relevant or ir-relevant faces, respectively. In the second stage, a baggingframework is used to combine weak classifiers trained onsubsets labeled from the ranked list into a strong classifier.This strong classifier is then applied to the original set tore-rank faces on the basis of the output probabilistic scores.Experiments on various face sets showed the effectivenessof this method. Our approach is beneficial when there areseveral faces in a returned image, as shown in Figure 11.

References

[1] O. Arandjelovic and A. Zisserman. Automatic face recog-nition for film character retrieval in feature-length films. InProc. Intl. Conf. on Computer Vision and Pattern Recogni-tion, volume 1, pages 860–867, 2005.

[2] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Facerecognition by independent component analysis. IEEETransactions on Neural Networks, 13(6):1450–1464, Nov2002.

[3] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’sin the picture? In Advances in Neural Information Process-ing Systems, 2004.

[4] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White,Y. W. Teh, E. G. Learned-Miller, and D. A. Forsyth. Namesand faces in the news. In Proc. Intl. Conf. on ComputerVision and Pattern Recognition, volume 2, pages 848–854,2004.

[5] D. Bolme, R. Beveridge, M. Teixeira, and B. Draper. Thecsu face identification evaluation system: Its purpose, fea-tures and structure. In International Conference on VisionSystems, pages 304–311, 2003.

[6] L. Breiman. Bagging predictors. Machine Learning,24(2):123140, 1996.

390390

Page 9: Le Satoh Unsupervised Face Annotation Icdm08

[7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF:Identifying density-based local outliers. In Proc. ACM SIG-MOD Int. Conf. on Management of Data(SIGMOD), pages93–104, 2000.

[8] C.-C. Chang and C.-J. Lin. LIBSVM: a library forsupport vector machines, 2001. Software available athttp://www.csie.ntu.edu.tw/" "cjlin/libsvm.

[9] M. Charikar. Greedy approximation algorithms for findingdense components in a graph. In APPROX ’00: Proceed-ings of the Third International Workshop on ApproximationAlgorithms for Combinatorial Optimization, pages 84–95.Springer-Verlag, 2000.

[10] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters ofdifferent sizes, shapes, and densities in noisy high dimen-sional data. In SIAM International Conference on Data Min-ing, pages 47–58, 2003.

[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatialdatabases with noise. In Proc. ACM SIGKDD Int. Conf. onKnowledge Discovery and Data Mining (SIGKDD), pages226–231, 1996.

[12] R. Fergus, P. Perona, and A. Zisserman. A visual categoryfilter for google images. In Proc. Intl. European Conferenceon Computer Vision, volume 1, pages 242–256, 2004.

[13] M. Kendall. Rank Correlation Methods. Charles GriffinCompany Limited, 1948.

[14] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out-liers: Algorithms and applications. VLDB Journal: VeryLarge Data Bases, 8(3-4):237–253, 2000.

[15] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic on-line picture collection via incremental model learning. InProc. Intl. Conf. on Computer Vision and Pattern Recogni-tion, volume 2, pages 1–8, 2007.

[16] D. Ozkan and P. Duygu. A graph based approach for namingfaces in news photos. In Proc. Intl. Conf. on Computer Vi-sion and Pattern Recognition, volume 2, pages 1477–1482,2006.

[17] J. Platt. Probabilistic outputs for support vector machinesand comparison to regularized likelihood methods. In Ad-vances in Large Margin Classifiers, pages 61–74, 1999.

[18] M. Turk and A. Pentland. Face recognition using eigenfaces.In Proc. Intl. Conf. on Computer Vision and Pattern Recog-nition, 1991.

[19] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features. In Proc. Intl. Conf. onComputer Vision and Pattern Recognition, volume 1, pages511–518, 2001.

[20] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimatesfor multi-class classification by pairwise coupling. Journalof Machine Learning Research, 5:975–1005, 2004.

[21] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Facerecognition: A literature survey. ACM Computing Surveys,35(4):399–458, 2003.

(a) - TBL - 11 irrelevant faces

(b) - DBO - 17 irrelevant faces

(c) - DSG - 18 irrelevant faces

(d) - LDS - 4 irrelevant faces

Figure 8. Top 50 faces ranked by the methodsTBL, DBO, DSG and LDS. Irrelevant faces aremarked with a star.

391391

Page 10: Le Satoh Unsupervised Face Annotation Icdm08

Figure 9. Performance of the ensemble clas-sifiers and single classifiers.

(a) - 5 irrelevant faces

(b) - no any irrelevant face

Figure 10. Top 20 faces ranked by GoogleImage Search Engine (a) and ranked usingour learned model (b). Irrelevant faces aremarked with a star.

Figure 11. Image returned by GoogleSE forquery ’Gerhard Schroeder’. GoogleSE wasunable to accurately identify who the queriedperson was, while the learned model of ourapproach accurately identified him.

Figure 12. Example in which portion of rel-evant faces is dominant, but it is difficult togroup all these faces into one cluster dueto large facial variations. In feature space,the largest cluster formed from relevant facesis not largest cluster among those formedfrom all returned faces. Irrelevant faces aremarked with a star.

Figure 13. Many irrelevant faces annotatedusing the model learned from the data setshown in Figure 12. Irrelevant faces aremarked with a star.

392392