Nonparametric Part Transfer for Fine-grained...

Nonparametric Part Transfer for Fine-grained Recognition

Christoph Goring, Erik Rodner, Alexander Freytag, and Joachim Denzler∗

Computer Vision Group, Friedrich Schiller University Jenawww.inf-cv.uni-jena.de

Abstract

In the following paper, we present an approach for fine-grained recognition based on a new part detection method.In particular, we propose a nonparametric label transfertechnique which transfers part constellations from objectswith similar global shapes. The possibility for transfer-ring part annotations to unseen images allows for copingwith a high degree of pose and view variations in scenar-ios where traditional detection models (such as deformablepart models) fail. Our approach is especially valuable forfine-grained recognition scenarios where intraclass varia-tions are extremely high, and precisely localized featuresneed to be extracted. Furthermore, we show the importanceof carefully designed visual extraction strategies, such ascombination of complementary feature types and iterativeimage segmentation, and the resulting impact on the recog-nition performance. In experiments, our simple yet pow-erful approach achieves 35.9% and 57.8% accuracy on theCUB-2010 and 2011 bird datasets, which is the current bestperformance for these benchmarks.

1. Introduction

Within the last decade, research in visual object recog-nition mostly focused on category-level classification [12].Although useful for coarse distinctions between objectclasses, common approaches fail in scenarios where the dif-ferentiation has to be done on a finer level, i.e., only smallinterclass differences exist and specific details matter. Thisarea of research, which is referred to as subordinate or fine-grained recognition, has become recently popular in ourcommunity (see e.g. [27, 22, 40, 39, 11]), because it offersa variety of challenging applications, such as bird recogni-tion, tasks even difficult for human annotators [26].

A large majority of visual object recognition methodsare solely based on global histogram features containingstatistics of local features calculated in the whole image.

∗This work was partially supported by the Carl Zeiss AG through the“Pro-Excellence” initiative of the state of Thuringia, Germany.

An overwhelming number of publications justifies this ap-proach for diverse classification problems, e.g., differentiat-ing cars from persons and bicycles [18]. However, for sub-ordinate or fine-grained recognition, this approach is lim-ited, since general appearances are highly similar for differ-ent classes and only small differences at certain positionsallow for discrimination. Consequently, a suitable algo-rithm needs to detect important parts for a reliable distinc-tion – ideally independent of the current object pose. Part-based approaches can be exploited to tackle this goal, sincethey allow for computing approximate pose invariant fea-tures [22, 39, 41]. For example, with given ground-truthpart positions, the method of [3] is able to boost the perfor-mance up to 22% (14 classes subset, CUB-2011 dataset)compared to their method with automatic part detection.This highlights the importance of a highly accurate part de-tection method. Based on a part detection model, whethersupervised [22, 41, 42] or unsupervised [39, 42, 38], theextracted information for every part can be combined for afinal classification.

In this paper, we follow a part-based approach and showthat standard parametric models for detection [14, 3] are notsufficient for tackling the large variations present in fine-grained recognition tasks. Given this analysis as a moti-vation, we propose a simple nonparametric part detectionalgorithm based on label transfer, which we refer to as non-parametric part transfer in the following. Our method al-lows for flexible poses, missing part annotations, and ex-treme changes in viewpoint. For fine-grained recognition,we use detected parts to compute localized features thatfocus on regions where discriminative visual elements areexpected. Our running example throughout the paper willbe the distinction of bird species, although our approachis not restricted to this application at all. In combinationwith a multiple feature extraction pipeline and linear clas-sifiers, our approach achieves state-of-the-art performanceon both CUB-200 datasets [37, 36], which are the standarddatasets used in the field. In particular, we show the benefitsof iterative segmentation-based masking and a combinationof color and shape-based features for an off-the-shelf fine-grained recognition system. An outline is given in Figure 1.

1

www.inf-cv.uni-jena.de

Nearest Neighbor Search

Part Transfer

Feature Extracion(part regions)

PoolingResult

Classificationresult:

"Flycatcher"

Input

Classification

SVM

+

+ beak:

neck:

leg (left):

tail:

... ...

beak:

neck:

leg (left):

tail:

... ...

beak:

neck:

leg (left):

tail:

... ...

SVM

SVM

"Warb

ler"

"Flyca

tcher"

"Warb

ler"

"Flyca

tcher"

"Warb

ler"

"Flyca

tcher"

+

Figure 1. Visualization of our approach using part transfer for fine-grained visual categorization: training images visually similar to the testimages are selected and existing part annotations are transferred to extract features from part locations only. Final results are obtained bypooling individual classification scores for every transferred part configuration. The figure is best viewed in color and by zooming in.

First, we review related work of fine-grained classifi-cation followed by an analysis of parametric part detec-tion models and their ability to cope with large pose varia-tions. In section 3, we present our nonparametric part trans-fer method to increase the flexibility of object part mod-els. How to use part locations to focus fine grained featureextraction is depicted in section 4 together with importantingredients such as segmentation-based masking and clas-sifier ensembles. Experiments are given in section 5.

2. Related workFine-grained recognition with global features One wayto tackle fine-grained recognition is to directly apply visualbag-of-words approaches commonly used for standard ob-ject categorization. Due to the visual similarity betweenclasses, there is a high likelihood for having a significantamount of common visual words shared between classesthat do not help distinguishing the classes from each other.Therefore, Khan et al. [18] improve dictionary learning byfusing visual words from two different feature types. Chaiet al. [7] replace the histogram with powerful Fisher vec-tors. Other approaches do not use dictionaries at all to avoidquantization errors [24, 39]. Reducing background clutterwhich might interfere classification is presented in [6, 25]using segmentation techniques. In contrast, we directly usedetection results to restrict feature extraction to image re-gions that are likely to contain object parts. Following pre-vious approaches, additional global features are computedon a mask obtained using GrabCut [28], which also elimi-nates background artifacts.

Another line of work uses active classification tech-niques, which require human interaction during testing torefine classification results [35, 5]. However, our main aimin this paper is to analyze fully automatic recognition ap-proaches that do not require further human interaction, al-

though such an approach could be combined with activeclassification methods to refine the results.

Part-based fine-grained recognition Global approachesdiscard the position of features, which are crucial in thefine-grained case. Part-based approaches avoid this infor-mation loss by extracting features on detected object partsonly. Previous techniques extract unsupervised parts us-ing an ellipsoid to model the bird pose [13] and fuse partsusing specialized kernel functions [41]. In a recent pa-per, Zhang et al. [42] show how to include an increasingamount of supervision for training deformable part mod-els which then allows for pose normalization by comparingcorresponding parts. Parkhi et al. detect a single main part(in their case the head of a dog) to discover the rest of thebody. The features used for classification are individuallyextracted from head and body [27]. For very specific ap-plications, it is even possible to perform classification onlybased on a single main part without extending the area offeature extraction [22]. In contrast, our approach uses everypart available, a necessity when tackling bird recognition.

Exemplar models and label transfer Transferring labelinformation from training to test images has successfullybeen used in several computer vision applications. A promi-nent technique in this field are Exemplar-SVMs as intro-duced in [23]. The idea is to train a single SVM for ev-ery training image as positive sample and millions of nega-tive samples, thereby bridging parametric and nonparamet-ric modeling. Label information can then be transferredto new images from training samples with high detectionscores of corresponding SVMs. Our idea is similar in spiritbut avoids expensive training by using a nearest neighbortransfer technique.

Another line of research is label transfer for semanticsegmentation and scene understanding proposed in [20].Their idea is to transfer pixelwise labels from a set of K

nearest neighbors of the test images. The label transfer isbased on a relaxed SIFT matching algorithm, called SIFT-Flow. In our method, we transfer specific part positionsonly, because the annotation is not available on a pixel level.Furthermore, instead of merging the part locations of the Kmost similar images in the training set to a more precisepart detection result, we show how to build a multiple partfeature representations to boost classification performance.

Parallel to us, the work of [21] and [16] very recentlyalso demonstrated the power of part detections for fine-grained recognition. The paper of [21] builds on [2] for partdetection. Their methodology is quite different from ours.We do not learn single part detectors and fuse them, insteadwe perform a simple but very powerful global matching(without any part position optimization) and a subsequentensemble learning.

3. Nonparametric part transferIn the following, we first review parametric part detec-

tion models and analyze their underlying assumptions. Thisanalysis motivates then our new nonparametric extension,which allows for flexible part detection models able to copewith a large variety of views and object poses.

3.1. Analysis of parametric part models

One of the most common approaches for object detec-tion in 2D images is the deformable parts approach of [14]A deformable part model M is meant to be invariant to cer-tain object deformations by allowing parts of the objects tomove. It consists of a set of filters F = {w,w1, . . . ,wm}and a model for their spatial layout expressed as a defor-mation model d. The root filter w is intended to cover thewhole object and the remaining filters cover parts of the ob-ject. The combined detection score is calculated by:

fM (x) = maxz

fw(x)−D(z;d) +m∑

k=1

fwk(x+ zk) (1)

where z = (z1, . . . , zm) are the latent part locations, D isthe cost of z with respect to a learned deformation modeld, and b is a bias term. The entire model is usually learnedwith a latent SVM scheme and is described in detail in [14].

The deformation model of a DPM is given by the pa-rameters d = (d1, . . . ,dm,v1, . . . ,vm) and the cost de-fined with part locations zk relative to anchor positions vk

learned for each part1:

D(z;d) =m∑

k=1

[zxk , z

yk , (z

xk )

2, (zyk)2]· dk (2)

1The deformation model given here ignores the fact that in the originalDPM formulation a finer resolution is used for all parts. However, ouranalysis still holds and we only ignore this fact for the DPM review tosimplify the presentation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

normalized x−coordinate

no

rma

lize

d y

−c

oo

rdin

ate

part beak

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


no

rmalized

y−

co

ord

inate

part tail

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


no

rma

lize

d y

−c

oo

rdin

ate

part forehead

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


no

rma

lize

d y

−c

oo

rdin

ate

part crown

Figure 2. Visualization of part offsets for the CUB-2011dataset [36].

This corresponds to a Gaussian model for the part locations,i.e., covariance matrices Sk ∈ R2 and mean vectors µk ∈R2 can be derived, such that the following holds:

D(z;d) ∝ − log

m∏k=1

N ((zxk , zyk) | µk,Sk) . (3)

The question remains whether this model is complexenough to capture the high variability encountered in fine-grained recognition applications. Hence, we analyzed partoffsets present in the CUB-2011 dataset [36] and visual-ized results for all bird species in Figure 2. The distribu-tion is clearly non-Gaussian, therefore, a single DPM modelwould not be able to model the variation present in the train-ing dataset. Note that in [14], parts and configurations arelearned in an unsupervised fashion. However, the analysisof the ”ground truth parts” provides a certain intuition forthe complexity of the problem tackled.

A common strategy to cope with different views is totrain a mixture of components, M = {M0,M1, ...}, inwhich case z is augmented to add the latent component la-bel that the example belongs to. However, as can be seen inour analysis, the distribution has a large number of differentcomponents and the data can not be easily clustered. There-fore, learning proper mixtures of deformable part models isan ill-posed problem and good generalization abilities cannot be expected a priori. Furthermore, not only is the de-formation model too restrictive, but also a single linear ap-pearance model is not flexible enough to capture the largevariability of birds.

3.2. Nearest neighbor part transfer

Our previous analysis motivates the development ofmore complex models able to tackle large intraclass vari-

test image (part transfer) nearest neighbourtest image (part transfer) nearest neighbour test image (part transfer) nearest neighbour

test image (part transfer) nearest neighbour

Figure 4. Example detections of our nonparametric part transfer technique together with the used nearest neighbor match.

Figure 3. Different bird poses present in CUB-2011.

ances, which are usually encountered in fine-grained recog-nition scenarios. In our running example, the distinction ofdifferent bird species, the high variation in part positionsarises from the large number of different poses of birds inimages, which are visualized for a single species in Fig-ure 3. For fine-grained recognition, part localization can bea crucial step. This is demonstrated by the results of [3],whose approach was able to boost recognition accuracy upto 22% (14 classes subset of the CUB-2011 dataset) whenbeing provided with the ground-truth part positions.

To overcome the limitations of linear detection mod-els, [33] proposed to use non-linear models to increase themodel complexity. However, these methods are costly dur-ing learning and prediction, despite the speed-up with ex-plicit feature transformations [30]. Therefore, we are fol-lowing an orthogonal line of research and learn not only asmall number of parametric models from the given trainingdata, but make use of a simple part transfer technique. Ourmethod searches for training images with an object shapesimilar to the current test image and then transfers the partannotations from them directly. Exemplary visual resultsobtained from the CUB-2011 dataset are given in Figure 4.

For training, the only thing that additionally needs to bedone is to compute a suitable feature representation, whichmainly focuses on the global shape of the object rather thanon color information for small part details. First, we usethe given bounding box of the object, which is availablefor the CUB-2011 dataset, to re-scale the image to a fixedsize of 256 × 256 pixels. After that, the overall layout isrepresented with histograms of oriented gradients (HOG),which have been shown to adequately describe the roughshape of an object [8].

For part detection during testing, we again compute aglobal HOG feature from the image region defined by theprovided bounding box.2 The K nearest neighbors with re-

2The CUB datasets also provide bounding box annotations for test im-ages and consequently our approach should be able to exploit this source ofinformation. Apart from this, bounding boxes are usually cheap to obtainand their usage is therefore no strong requirement.

spect to the HOG feature are then obtained from the train-ing set and part positions of the nearest neighbors are scaledproportionally to the bounding box of the test image. Sim-ilar to deformable part models, the size of the parts is fixedto a constant scale. In our case, we use squared parts with alength proportional to the diagonal d of the bounding box inthe test image. The transferred part positions together withthe scaled part sizes are the final result of our nonparametricpart detection method.

Following established techniques for data augmenta-tion [19, 14], we also add the flipped version of each train-ing image to the training set to allow for matches betweenimages of different orientation. In the case of a test imagematching a flipped image, part correspondences are handledproperly, e.g., the flipped left wing position will be the rightwing position of the flipped image.

We want to point out that our part detection techniqueis indeed related to the idea of Exemplar-SVM [23]. How-ever, instead of learning a single nonparametric HOG de-tector, the authors propose to use a trained detector for eachpositive example. A main motivation of their technique is toallow for transferring segmentation masks to test images. Incontrast, we transfer part annotations that are important tofacilitate fine-grained recognition later on. Furthermore, theexpensive learning stage of [23] is not necessary in our case,since we do not need to perform sliding-window detectionand hard negative mining for tackling the large number ofpossible negative examples.

4. Part feature representations

In the following, we show how to compute multiple fea-tures that capture shape as well as color information foreach of the parts. Furthermore, we show how to build en-sembles of part feature representations to allow for morerobust classification decisions.

4.1. Feature extraction for single parts

One of the main reasons for the difficulty of fine-grainedrecognition is that although small details in appearance mat-ter, a method still has to cope with a large intraclass vari-ance, e.g., different poses and view points for the taskof bird recognition. This trade-off is especially importantwhen carefully designing the feature extraction step.

By using the part-based approach introduced in sec-tion 3, we already obtained a large degree of invariance with

Figure 5. Example images of the CUB-2011 dataset [36] with cor-responding color name features as proposed by [32].

respect to different poses and view points: (i) features areextracted on relevant positions only and (ii) a mapping ofcorresponding parts is directly possible. However, we stillhave to deal with a large appearance diversity even within asingle part. This fact is visualized in Figure 3 by showingimage regions of parts for different images of a single birdspecies. Therefore, we follow established approaches by us-ing an unordered bag-of-visual-words model based on com-plimentary types of local features. As argued by [18], thecombination of color and texture information is importantfor the task of bird recognition, which is again our runningexample in this paper to analyze our part-based approach.Some bird species, for example the red-bellied woodpecker,have a characteristic head color (a fact which is contraryto what the name would suggest). Furthermore, texture orshape information can help to recognize birds with charac-teristic eyes or beaks. In detail, we use two types of localfeatures to capture color and texture/shape information.Local shape and color descriptors Instead of directlyusing the RGB values within a local pixel neighborhood,Weijer et al. [32] proposed to map colors to a space spannedby a basis of L colors. The colors have a correspondence inthe English language, such as red, orange, blue, and yellow(L = 11 was used in their paper). In detail, a color c ismapped to the probability vector [p(t`|c)]L`=1 stating howlikely c would be described with the color name t`. Proba-bilities are estimated using images obtained by Google im-age search. Figure 5 shows some results of the color namerepresentation, where only the most probable color proto-type is shown. In particular, it is interesting to see how thedominant color of the bird is captured, e.g., the blue coloredbody in the first example and the red colored back of thehead in the second example.

In addition, we capture coarse shape information withOpponentSIFT descriptors presented in [31] by using gra-dient orientation histograms similar to the common SIFTdescriptor but obtained from the images converted in theopponent color space.Part-specific codebooks Codebooks for the bag-of-visual-word-models are learned for each part individually.The intuition is that this allows for learning prototypical

Figure 6. GrabCut segmentation results provide a mask for globalfeature extraction.

local elements specific for each part, e.g., elements of abeak. Part-specific codebooks are created using k-meansclustering of local features extracted from the correspond-ing part only. Consequently, each part is represented by twohistograms, one for the color features and one for Oppo-nentSIFT descriptors, which are both normalized and finallyconcatenated.

4.2. Classification with part feature ensembles

As introduced before, parts detected with our nonpara-metric technique (see section 3) are represented with com-bined features. For a transferred part configuration, wethereby obtain m combined histograms for m annotatedparts. However, not all parts have to be visible in an im-age due to the relative position of bird and camera (see Fig-ure 7 for some examples). In absence of further knowl-edge, we use zero imputation [29]. We also studied othermore sophisticated imputation methods, but did not observea significant difference in terms of resulting recognition per-formance. To finally fuse information of all parts, severalquite general possibilities exist: (i) late binding: every partis modeled by its own classifier and classification scores ofall parts are combined via (weighted) pooling, or (ii) earlybinding: features are concatenated and a single classifier istrained using all information simultaneously. Due to the ad-vantage of early binding to exploit dependencies betweenthe different parts, our final part representation is the con-catenation of all part features.

Although this sounds promising so far, we are still notdone yet: features based on part detections transferred fromonly a single nearest neighbor will usually not be robustenough with respect to wrong matches. Therefore, we builda part feature ensemble by transferring part locations fromeach of the k nearest neighbors. For every part configura-tion transferred from the nearest neighbors of a test image,parts are represented by computing histograms over colorand texture as introduced before. Final classification resultsare obtained by pooling scores over all transferred part con-figurations using average pooling and the category decisionis done by taking the category with the maximum score. Forclassification, we use a SVM with the χ2-kernel and the ex-plicit feature transformation technique presented in [34].

4.3. Additional global feature extraction

Instead of relying only on part feature representations,we also compute a global representation using the whole

Figure 7. Example images of the CUB-2011 dataset [36] show-ing exploited part annotations (top) and corresponding regions forfeature extraction (bottom). Occluded parts are displayed in black.

bounding box. Feature types are the same as used for de-scribing parts (bag-of-visual words with OpponentSIFT andcolor names) but with additional spatial pyramid pooling.Furthermore, the influence of background clutter will bemuch higher here. Therefore, we apply GrabCut segmenta-tion [28] to estimate the foreground. The algorithm of [28]performs iterative segmentation with a conditional Markovrandom field, where unary potentials are modeled with aGaussian mixture model re-estimated in each iteration, andpairwise potentials are added to favor strong image edges.Some example results of this segmentation technique can beseen in Figure 6.

5. Experiments5.1. Experimental setup

We evaluate our approach on the CUB-2010 [37] and theCUB-2011 dataset [36], which are the common benchmarksfor fine-grained recognition algorithms. Both datasets con-sist of 200 different bird species and provide a boundingbox for each image. In contrast to the CUB-2010 dataset,the CUB-2011 dataset also contains part annotations for 15parts, e.g., left and right eye, or beak (see Figure 7 for ex-ample images), which is necessary for our part transfer al-gorithm and also the information exploited in [3, 42]. Wepresent results computed on the whole CUB-2011 datasetas well as on the commonly used 14 class subset [3]. Ourresults on CUB-2010 are based on estimating part positionsfor test as well as training images by using our nonparamet-ric part detection approach to transfer part locations fromthe CUB-2011 dataset. Note that using a few additionalannotations during training is a common strategy for thisdataset, for example [42] uses manual head and body partannotations and [9] utilizes discriminative parts obtainedwith an annotation game.

We compare our approach with the recent state-of-the-art in fine-grained recognition: deformable part descriptors(DPD) [42], part-based one-vs-one features (POOF) [3],pooling-invariant image feature learning (PDL) [17], thetemplate learning approach of [38], the segmentation tech-

Approach CUB-2011/14 CUB-2011/200

PDL [17] - 38.91%Template learning [38] - 43.67%DPD [42] - 50.98%POOF [3] 70.10% 56.78%

Ours, DPM 67.59% 39.18%Ours, part transfer 69.85% 54.76%Ours, part transferensemble with k = 5 73.86% 57.84%

Table 1. Mean accuracy results on CUB-2011.

nique of [1], and the Bubblebank method of [9]. Further-more, we also compare our approach with a DPM baselinewith five components, where part locations are initialized ina supervised manner during training using the provided partannotations. After part detection, part and global featuresare calculated as described before.

5.2. EvaluationCUB-2011 Recognition results for the CUB-2011 datasetare given in Table 1. As can be seen, our method outper-forms previous methods for the 14 class subset as well asfor the whole 200 classes dataset. Furthermore, as expectedwe gain by using our part transfer instead of a deformablepart model (part transfer vs. DPM with over 15% gain forthe whole dataset), and also by using a part transfer ensem-ble instead of only a single part representation (part transferensemble vs. part transfer with at least 3% gain for bothsettings). A qualitative comparison is given in Figure 8. Incontrast to the DPM baseline, our approach is able to prop-erly transfer parts and thereby to avoid misclassificationsin nearly all cases. An example where our method failsis given in the lower left image, where the background isheavily cluttered and thereby misleads our HOG-based parttransfer technique.

It should be noted that the best but still unpublished(although available on arXiv) approach on the CUB-2011dataset is currently a deep learning version [10] of the DPDapproach [42]. This approach achieved a performance of64.96% on the whole dataset by just exchanging the featurerepresentation inside of DPD with features calculated witha pre-trained deep learning architecture. Therefore, we be-lieve that the performance of our approach can also be im-proved significantly by incorporating deep learning featuresand we consider this as interesting future work.

Compared to [21], we achieved an mAP performance of76.94% ([21]: 62.42%) using 14 classes and 55.36% ([21]:44.13%) using 200 classes.CUB-2010 The results on the CUB-2010 dataset aregiven in Table 2 and show that we are consistently able tooutperform previous methods. Note that for this dataset,we only estimate part locations by part transfer from CUB-2011, but do not use additional training examples from the

DPM:

Ours: 0.3917

-0.4545

Score:

+

Result:Est. Class:

'Mourning Warbler'

'Nashville Warbler'

DPM:

Ours: 0.2970

-0.8739

Score:

+

Result:Est. Class:

'Red Headed Woodpecker'

'White breasted Nuthatch'

+Result:

0.7628

-0.8140

Score:

-0.4184

Score:

0.2345

DPM:

Ours:

+

Result:Est. Class:

'Red faced Cormorant'

'Ruby throated Hummingbird'

DPM:

Ours:

Est. Class:

'Hooded Oriole'

'Yellow Headed Blackbird'

DPM:

Ours: 0.3808

-0.6085

Score:

+

Result:Est. Class:

'Eared Grebe'

'Clark Nutcracker'

DPM:

Ours: 0.3172

-0.6267

Score:

+

Result:Est. Class:

'Lazuli Bunting'

'Blue headed Vireo'

Figure 8. Qualitative evaluation on CUB-2011. We displayed cases where our approach and the DPM baseline show significant differences.

Approach CUB-2010

Template learning [38] 28.20%Segmentation [1] 30.20%Bubblebank [9] 32.50%DPD [42] 34.50%Ours, part transfer 35.94%

Table 2. Mean accuracy results on CUB-2010.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

normalized rank

no

rma

lize

d e

uclid

ea

n e

rro

ro

f p

art

de

tectio

n

back (Our approach)nape (Our approach)left leg (Our approach)back (DPM)nape (DPM)left leg (DPM)

60

65

70

75

80

85

90

95

100

0 5 10 15 20 25 30 35 40

me

an

accu

racy o

n C

UB

-20

11

/14

std. dev. of the part location noise added to gt.

Our approach (noisy ground-truth)Our approach (part transfer)

Our approach (DPM only)

Figure 9. Left: Percentile plot of the error in part positions forthree parts. Right: Dependency of part detection and classificationaccuracy (CUB-2011/14 dataset).

CUB-2011 dataset for feature extraction. The most interest-ing result here is that we are able to outperform the Bubble-bank approach of [9] despite of their heavy use of crowd-sourced relevant part annotations.

Analysis of individual part detection In the following,we analyze the detection error of individual parts and com-pare our part transfer method with the DPM baseline, wheresemantic parts are used for initialization. We plotted the er-ror distribution for three parts (back, nape, and leg) as apercentile plot in Figure 9 (left figure), where the error isdefined as the normalized Euclidean distance between es-timated and ground-truth part positions. First, it can beseen that the error distribution varies significantly acrossparts, which is consistent with the findings of [36]. Fur-thermore, and more importantly, our simple part transferapproach leads to significantly lower errors then the DPM.An analysis of the dependency between part detection andclassification accuracy is finally visualized in the right partof Figure 9. In particular, we use the ground-truth part lo-cations of the test images for our fine-grained recognitionapproach but added zero-mean Gaussian noise to it with a

Method CUB-2011/14

NN part transfer 69.85%. . . without GrabCut masking only 68.09%. . . without part features only 66.08%. . . without global features only 61.56%

Table 3. Detailed feature analysis on CUB-2011 with mean accu-racies. Only specific single aspects are excluded.

varying standard deviation. The plot reveals the part de-tection precision has a significant impact on the resultingclassification performance, another motivation for nonpara-metric detection methods such as the one presented here.Analysis of the feature representation Apart from ourpart detection method, we also proposed feature combi-nations of part and global features as well as using Grab-Cut [28] to reduce the influence of background clutter. Ta-ble 3 gives results of our method with and without one ofthese aspects. As can be seen in the accuracy numbers, acombination of local and global features is important andGrabCut masking can additionally help in general.

6. ConclusionsIn this paper, we tackled the problem of fine-grained

recognition, which is highly challenging particularly due tosevere variations in object poses and viewpoints. We there-fore introduced a nonparametric approach for part detectionwhich is based on transferring part annotations from relatedtraining images to an unseen test image. This allows for afeature extraction step that focuses on those parts of imageswhere discriminative features are likely to be located. Inour experiments, we observed a significant gain over estab-lished part detection techniques like DPMs and also pro-vided a theoretical analysis why DPMs suffer from highpose variations. Additionally, we showed how well-knowntechniques for object recognition, such as the combinationof complementary feature types and image masking, caneasily be added to obtain a simple yet powerful recognitionsystem for fine-grained classification scenarios. Despite ofthe simplicity of our approach, we were able to outperformprevious approaches on the standard benchmark datasets

CUB-2010 and 2011. Furthermore, our paper clearly moti-vates the use of nonparametric part and label transfer tech-niques, and might also help to bridge the gap between thebranches of category-based recognition and instance-levelmatching. For future work, we are interested in exploitat-ing part transfer for lifelong learning settings, e.g. duringthe process of active learning [15]. In addition, supervisedfeature transformations [4] might further help reducing theintra-class variations still apparant in correctly aligned partsof similar bird species.

References[1] A. Angelova and S. Zhu. Efficient object detection and segmentation

for fine-grained recognition. In CVPR, 2013. 6, 7[2] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar.

Localizing parts of faces using a consensus of exemplars. PAMI,35(12):2930–2940, 2013. 3

[3] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs-one featuresfor fine-grained categorization, face verification, and attribute esti-mation. In CVPR, 2013. 1, 4, 6

[4] P. Bodesheim, A. Freytag, E. Rodner, M. Kemmler, and J. Denzler.Kernel null space methods for novelty detection. In CVPR, pages3374–3381, 2013. 8

[5] S. Branson, P. Perona, and S. Belongie. Strong supervision fromweak annotation: Interactive training of deformable part models. InICCV, 2011. 2

[6] Y. Chai, V. S. Lempitsky, and A. Zisserman. Bicos: A bi-level co-segmentation method for image classification. In ICCV, pages 2579–2586, 2011. 2

[7] Y. Chai, E. Rahtu, V. S. Lempitsky, L. J. V. Gool, and A. Zisserman.Tricos: A tri-level class-discriminative co-segmentation method forimage classification. In ECCV, pages 794–807, 2012. 2

[8] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection. In CVPR, pages 886–893, 2005. 4

[9] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing forfine-grained recognition. In IEEE CVPR, June 2013. 6, 7

[10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell. DeCAF: A Deep Convolutional Activation Feature forGeneric Visual Recognition. ArXiv e-prints, Oct. 2013. 6

[11] K. Duan, D. Parikh, D. J. Crandall, and K. Grauman. Discovering lo-calized attributes for fine-grained recognition. In CVPR, pages 3474–3481, 2012. 1

[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis-serman. The pascal visual object classes (voc) challenge. IJCV,88(2):303–338, June 2010. 1

[13] R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, and L. S.Davis. Birdlets: Subordinate categorization using volumetric primi-tives and pose-normalized appearance. In ICCV, 2009. 2

[14] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-basedmodels. PAMI, 32(9):1627–1645, 2010. 1, 3, 4

[15] A. Freytag, E. Rodner, P. Bodesheim, and J. Denzler. Labeling ex-amples that matter: Relevance-based active learning with gaussianprocesses. In GCPR, pages 282–291, 2013. 8

[16] E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars.Fine-grained categorization by alignments. In ICCV, 2013. 3

[17] Y. Jia, O. Vinyals, and T. Darrell. Pooling-Invariant Image FeatureLearning. ArXiv e-prints, Jan. 2013. 6

[18] F. S. Khan, J. van de Weijer, A. D. Bagdanov, and M. Vanrell.Portmanteau vocabularies for multi-cue image representation. InJ. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,editors, NIPS, pages 1323–1331, 2011. 1, 2, 5

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifica-tion with deep convolutional neural networks. In NIPS, 2012. 4

[20] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing vialabel transfer. PAMI, 33(12):2368–2382, 2011. 2

[21] J. Liu and P. N. Belhumeur. Bird part localization using exemplar-based models with enforced pose and subcategory consistency. InICCV, 2013. 3, 6

[22] J. Liu, A. Kanazawa, D. W. Jacobs, and P. N. Belhumeur. Dog breedclassification using part localization. In ECCV, pages 172–185, 2012.1, 2

[23] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV, 2011. 2, 4

[24] G. Martınez-Munoz, N. L. Delgado, E. N. Mortensen, W. Zhang,A. Yamamuro, R. Paasch, N. Payet, D. A. Lytle, L. G. Shapiro,S. Todorovic, A. Moldenke, and T. G. Dietterich. Dictionary-freecategorization of very similar objects via stacked evidence trees. InCVPR, pages 549–556, 2009. 2

[25] M.-E. Nilsback and A. Zisserman. Delving deeper into the whorlof flower segmentation. Image Vision Comput., 28(6):1049–1062,2010. 2

[26] A. of European Records and R. Committees. Taxonomic recommen-dations, 2003. 1

[27] O. M. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Cats anddogs. In CVPR, 2012. 1, 2

[28] C. Rother, V. Kolmogorov, and A. Blake. ”grabcut”: interactive fore-ground extraction using iterated graph cuts. ACM Trans. Graph.,23(3):309–314, 2004. 2, 6, 7

[29] M. Saar-Tsechansky and F. Provost. Handling missing values whenapplying classification models. JMLR, 8:1625–1657, 2007. 5

[30] V. Sreekanth, A. Vedaldi, C. V. Jawahar, and A. Zisserman. General-ized RBF feature maps for efficient detection. In Proceedings of theBMVC, 2010. 4

[31] K. E. A. van de Sande, T. Gevers, and C. G. Snoek. Evaluating colordescriptors for object and scene recognition. PAMI, 32(9):1582–1596, 2010. 5

[32] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learningcolor names for real-world applications. IEEE Transactions on Im-age Processing (TIP), 18(7):1512–1523, 2009. 5

[33] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple ker-nels for object detection. In ICCV, pages 606–613, 2009. 4

[34] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicitfeature maps. PAMI, 34(3):480–493, 2012. 5

[35] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recogni-tion and part localization with humans in the loop. In ICCV, 2011.2

[36] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Thecaltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, Caltech, 2011. 1, 3, 5, 6, 7

[37] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie,and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, Caltech, 2010. 1, 6

[38] S. Yang, L. Bo, J. Wang, and L. Shapiro. Unsupervised templatelearning for fine-grained object recognition. In NIPS, 2012. 1, 6, 7

[39] B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In CVPR, 2012.1, 2

[40] B. Yao, A. Khosla, and F.-F. Li. Combining randomization and dis-crimination for fine-grained image categorization. In CVPR, pages1577–1584, 2011. 1

[41] N. Zhang, R. Farrell, and T. Darrell. Pose pooling kernels for sub-category recognition. In CVPR, pages 3665–3672, 2012. 1, 2

[42] N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable partdescriptors for fine-grained recognition and attribute prediction. InICCV, 2013. 1, 2, 6, 7

Nonparametric Part Transfer for Fine-grained...

Documents

Transcript of Nonparametric Part Transfer for Fine-grained...