Deep Representation Learning for Keypoint localization · Deep Representation Learning for Keypoint...
Transcript of Deep Representation Learning for Keypoint localization · Deep Representation Learning for Keypoint...
Deep Representation Learning
for Keypoint localization
Shaoli Huang
Faculty of Engineering and Information Technology
University of Technology Sydney
A thesis submitted for the degree of
Doctor of Philosophy
2017
To my family
Mingjiang Liang and Jingyi Huang
Certificate of Original Authorship
I certify that the work in this thesis has not previously been submitted
for a degree nor has it been submitted as part of requirements for a
degree except as fully acknowledged within the text.
I also certify that the thesis has been written by me. Any help that I
have received in my research work and the preparation of the thesis
itself has been acknowledged. In addition, I certify that all informa-
tion sources and literature used are indicated in the thesis.
Shaoli Huang
Acknowledgements
First and foremost, I would like to thank my supervisorProf. Dacheng
Tao, who not only guide me to the field of computer vision but also
provide me advice on life and careers.
I would also like to thank my parents, my brother and my sisters
for giving me love and support. I am very thankful to my dear wife
Mingjiang Liang, who has been with me these years. She takes care of
the family and allows me spending more time on the research study. I
am also thankful for the unwavering love and general happiness that
she has brought into my life. Along with her, I want to thank my
daughter, Jingyi Huang. She has been a pure joy and has made my
life much more fun. I am also thankful to my mother-in-law Fengying
Lei, who takes care of my family when I was writing the thesis.
I also would like to give special thanks to Mingming Gong for numer-
ous discussions that have played a significant role in bringing clarity
to my ideas. I also would like to thank Dr. Jun Li and Dr. Zhe Xu
who spend much time on having a discussion with me.
Finally, I would like like to thank my colleagues and friends I met in
Sydney: Shirui Pan, Ruxin Wang, Tongliang Liu, Chang Xu, Haishuai
Wang, Huan Fu and so many others.
Abstract
Keypoint localization aims to locate points of interest from the in-
put image. This technique has become an important tool for many
computer vision tasks such as fine-grained visual categorization, ob-
ject detection, and pose estimation. Tremendous effort, therefore, has
been devoted to improving the performance of keypoint localization.
However, most of the proposed methods supervise keypoint detectors
using a confidence map generated from ground-truth keypoint loca-
tions. Furthermore, the maximum achievable localization accuracy
differs from keypoint to keypoint, because it is determined by the un-
derlying keypoint structures. Thus the keypoint detector often fails
to detect ambiguous keypoints if trained with strict supervision, that
is, permitting only a small localization error. Training with looser su-
pervision could help detect the ambiguous keypoints, but this comes
at a cost to localization accuracy for those keypoints with distinctive
appearances. In this thesis, we propose hierarchically supervised nets
(HSNs), a method that imposes hierarchical supervision within deep
convolutional neural networks (CNNs) for keypoint localization. To
achieve this, we firstly propose a fully convolutional Inception network
with several branches of varying depths to obtain hierarchical feature
representations. Then, we build a coarse part detector on top of each
branch of features and a fine part detector which takes features from
all the branches as the input.
Collecting image data with keypoint annotations is harder than with
image labels. One may collect images from Flickr or Google images
by searching keywords and then perform refinement processes to build
a classification dataset, while keypoint annotation requires human to
click the rough location of the keypoint for each image. To address the
problem of insufficient part annotations, we propose a part detection
framework that combines deep representation learning and domain
adaptation within the same training process. We adopt one of the
coarse detector from HSNs as the baseline and perform a quantita-
tive evaluation on CUB200-2011 and BirdSnap dataset. Interestingly,
our method trained on only 10 species images achieves 61.4% PCK
accuracy on the testing set of 190 unseen species.
Finally, we explore the application of keypoint localization in the
task of fine-grained visual categorization. We propose a new part-
based model that consists of a localization module to detect object
parts (where pathway) and a classification module to classify fine-
grained categories at the subordinate level (what pathway). Exper-
imental results reveal that our method with keypoint localization
achieves the state-of-the-art performance on Caltech-UCSD Birds-
200-2011 dataset.
Contents
Contents i
List of Figures v
List of Tables ix
1 Introduction 1
1.1 Objectives and Motivation . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Keypoints Localization . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 5
1.2.3 Bird Part Localization . . . . . . . . . . . . . . . . . . . . 8
1.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 9
1.4 Fine-grained Visual Categorization . . . . . . . . . . . . . . . . . 10
1.5 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . 11
1.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Hierarchically Supervisided Nets for Keypoint Localization 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Bird part detection . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Human pose estimation . . . . . . . . . . . . . . . . . . . . 19
2.3 Hierarchically Supervised Nets . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . 20
i
CONTENTS
2.3.2 Learning and Inference . . . . . . . . . . . . . . . . . . . . 25
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Bird Part Localization . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 33
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Transferring Part Locations Across Fine-grained Categories 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Relate Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Part Detection. . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Domain Adaptation and Active Learning . . . . . . . . . . 39
3.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Optimization with Backpropagation . . . . . . . . . . . . . 43
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Datasets and Setting . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . 45
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Fine-grained Categorization with Part Localization 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 Keypoint Localization . . . . . . . . . . . . . . . . . . . . 52
4.2.2 Fine-Grained Visual Categorization . . . . . . . . . . . . . 53
4.3 Part-Stacked CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Localization Network . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Classification network . . . . . . . . . . . . . . . . . . . . 58
4.4 Deeper Part-Stacked CNN . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Localization Network . . . . . . . . . . . . . . . . . . . . . 62
4.4.2 Classification network . . . . . . . . . . . . . . . . . . . . 67
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5.1 Dataset and implementation details . . . . . . . . . . . . . 70
4.5.2 Localization results for PSCNN . . . . . . . . . . . . . . . 70
ii
CONTENTS
4.5.3 Classification results for PSCNN . . . . . . . . . . . . . . . 72
4.5.4 Localization Results for DPSCNN . . . . . . . . . . . . . . 74
4.5.5 Classification results for DPSCNN . . . . . . . . . . . . . . 79
4.5.6 Model interpretation . . . . . . . . . . . . . . . . . . . . . 82
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Conclusions 87
References 89
iii
CONTENTS
iv
List of Figures
1.1 Illustrating the pose estimation problem. . . . . . . . . . . . . . . 6
1.2 Illustrating the challenges of human pose estimation. . . . . . . . 7
1.3 Illustrating the bird part localizatoin problem. . . . . . . . . . . . 8
2.1 An illustration of the predicted keypoints from our HSN architec-
ture. The left image contains highly accurate keypoints detected
by the fine detector with strict supervision, the middle image con-
tains keypoints from coarse detectors with loose supervisions, and
the right image shows the final predictions by unifying the fine and
coarse detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Network architecture of the hierarchically supervised nets. The
coarse stream learns three coarse detectors using hierarchical su-
pervisions and while the fine stream learns a fine detector via strict
supervision. Then the coarse predictions and fine predictions are
unified for final prediction in inference stage. . . . . . . . . . . . . 21
2.3 Different methods for obtaining multiple-scale . (a) Input multiple
resolution images. (b) Using different size of convolutional filters
(c) concatenation of different resolutions of feature maps. (d) con-
catenation of feature maps from different layers, each of which has
multiple convolutional filters. . . . . . . . . . . . . . . . . . . . . 24
2.4 An illustration of . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Bird part detection results with occlusion,viewpoint, clustered
background, and pose from the test set. . . . . . . . . . . . . . . . 28
2.6 Pose estimation results with occlusion, crowding, deformation, and
low resolution from the COCO test set. . . . . . . . . . . . . . . . 32
v
LIST OF FIGURES
3.1 Illustration of the research problem. The source domain contains
part annotations, while parts are not annotated in the target do-
main. Also, the target domain contains species which do not exist
in the source domain. . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 The proposed architecture consist of three components: a feature
extractor (yellow), a part classifier, and a domain classifier (blue).
All these components share computation in a feed-forward pass.
The feature extractor outputs feature representation as the input
of the other components. The part classifier is designed to find
the part location, while domain classifier is added to handle the
domain shift between source and target domain. Note that the
backpropagation gradients that pass from domain classifier to the
feature extractor are multiplied by a negative constant during the
backpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Overview of the proposed approach. We propose to classify fine-
grained categories by modeling the subtle difference from specific
object parts. Beyond classification results, the proposed DPS-CNN
architecture also offers human-understandable instructions on how
to classify highly similar object categories explicitly. . . . . . . . . 49
4.2 Illustration of the localization network. (a). Suppose a certain
layer outputs feature maps with size 3x3, and the corresponding
receptive fields are shown by dashed box. In this paper, we rep-
resent the center of each receptive filed with a feature vector at
the corresponding position. (b). The first column is the input
image. In the second image, each black dot is a candidate point
which indicates the center of a receptive field. The final stage is to
determine if a candidate point is a particular part or not. . . . . 54
vi
LIST OF FIGURES
4.3 The network architecture of the proposed Part-Stacked CNNmodel.
The model consists of 1) a fully convolutional network for part
landmark localization; 2) a part stream where multiple parts share
the same feature extraction procedure, while being separated by
a novel part crop layer given detected part locations; 3) an ob-
ject stream with lower spatial resolution input images to capture
bounding-box level supervision; and 4) three fully connected layers
to achieve the final classification results based on a concatenated
feature map containing information from all parts and the bound-
ing box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Demonstration of the localization network. The training process
is denoted inside the dashed box. For inference, a Gaussian kernel
is then introduced to remove noise. The results are M 2D part
locations in the 27× 27 conv5 feature map. . . . . . . . . . . . . 58
4.5 Demonstration of the localization network. Training process is
denoted inside the dashed box. For inference, a Gaussian kernel
is then introduced to remove noise. The results are M 2D part
locations in the 27× 27 conv5 feature map. . . . . . . . . . . . . 62
4.6 Network architecture of the proposed Deeper Part-Stacked CNN.
The model consists of: (1) a fully convolutional network for part
landmark localization; (2) a part stream where multiple parts share
the same feature extraction procedure, while being separated by a
novel part crop layer given detected part locations; (3) an object
stream to capture global information; and (4) Feature fusion layer
with input feature vectors from part stream and object stream to
achieve the final feature representation. . . . . . . . . . . . . . . . 65
4.7 Different strategies for feature fusion which are illustrated in (a)
Fully connected,(b) Scale Sum, (c) Scale Max and (d) Scale Aver-
age Max respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Typical localization results on CUB-200-2011 test set. We show 6
of the 15 detected parts here. They are: beak (red), belly (green),
crown (blue), right eye (yellow), right leg (magenta), tail (cyan).
Better viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . 71
vii
LIST OF FIGURES
4.9 Typical localization results on CUB-200-2011 test set. Better viewed
in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.10 Feature maps visualization of Inception-4a layer. Each example
image is followed by three rows of top six scoring feature maps,
which are from the part stream, object stream and and baseline
BN-inception network respectively. Red dash box indicates a fail-
ure case of visualization using the model learned by our approach. 78
4.11 Example of the prediction manual generated by the proposed ap-
proach. Given a test image, the system reports its predicted class
label with some typical exemplar images. Part-based comparison
criteria between the predicted class and its most similar classes
are shown in the right part of the image. The number in brackets
shows the confidence of classifying two categories by introducing
a specific part. We present top three object parts for each pair
of comparison. For each of the parts, three part-center-cropped
patches are shown for the predicted class (upper rows) and the
compared class (lower rows) respectively. . . . . . . . . . . . . . . 86
viii
List of Tables
2.1 Comparison with methods that report per-part PCK(%) and aver-
age PCK(%) on CUB200-2011. The abbreviated part names from
left to right are: Back, Beak, Belly, Breast, Crown, Forehead,Left
Eye,Left Leg, Left Wing, Nape, Right Eye, Right Leg, Right Wing,
Tail, and Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Comparison of PCP(%) and over-all PCP(%) on CUB200-2011.
The abbreviated part names from left to right are: Back, Beak,
Belly, Breast, Crown, Forehead, Eye, Leg, Wing, Nape,Tail, and
Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Performance comparison between using strict supervision only and
hierarchical supervision. . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Results on COCO keypoint on test-dev and test-standard split . . 30
3.1 Part transferring results for different splits of CUB200-2011 dataset.
Per-part PCKs(%) and mean PCK(%) are given.The abbreviated
part names from left to right are: Back, Beak, Belly, Breast,
Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, Right Eye,
Right Leg, Right Wing, Tail, and Throat . . . . . . . . . . . . . . 44
3.2 Part transferring from CUB200-2011(Source) to BirdSnap(Target).
Per-part PCKs(%) and mean PCK(%) are given. . . . . . . . . . 45
4.1 APK for each object part in the CUB-200-2011 test set in descend-
ing order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ix
LIST OF TABLES
4.2 Comparison of different model architectures on localization results.
“conv5” stands for the first 5 convolutional layers in CaffeNet;
“conv6(256)” stands for the additional 1 × 1 convolutional layer
with 256 output channels; “cls” denotes the classification layer
with M + 1 output channels; “gaussian” represents a Gaussian
kernel for smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 The effect of increasing the number of object parts on the classifi-
cation accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 The effect of increasing the number of object parts on the classifi-
cation accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Comparison with state-of-the-art methods on the CUB-200-2011
dataset. To conduct fair comparisons, for all the methods using
deep features, we report their results on the standard seven-layer
architecture (mostly ALexNet except VGG-m for [52]) if possible.
Note that our method achieves comparable results with state-of-
the-art while running in real-time. . . . . . . . . . . . . . . . . . . 74
4.6 Receptive field size of different layers. . . . . . . . . . . . . . . . . 76
4.7 Comparison of per-part PCK(%) and over-all APK(%) on CUB200-
2011. The abbreviated part names from left to right are: Back,
Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing,
Nape, Right Eye, Right Leg, Right Wing,Tail, and Throat . . . . 76
4.8 Localization recall of candidate points selected by inception-4a
layer with different α values. The abbreviated part names from
left to right are: Back, Beak, Belly, Breast, Crown, Forehead,Left
Eye,Left Leg, Left Wing, Nape, Right Eye, Right Leg, Right Wing,
Tail, and Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.9 Localization recall of candidate points selected by inception-4a
layer with different α values. The abbreviated part names from
left to right are: Back, Beak, Belly, Breast, Crown, Forehead,Left
Eye,Left Leg, Left Wing, Nape, Right Eye, Right Leg, Right Wing,
Tail, and Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Comparison of different settings of our approach on CUB200-2011 . 80
x
LIST OF TABLES
4.11 Comparison with state-of-the-art methods on the CUB-200-2011
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xi
Chapter 1
Introduction
1.1 Objectives and Motivation
Keypoint localization refers to the task of finding points of interest in an image.
These points can be divided into the feature keypoints and the semantic keypoints
according to their intended use in visual applications. The feature keypoints
mainly used as reference points to outline an object. A typical example of this is
facial landmarks localization [45,128,131], where landmarks are used to represent
facial features and geometry, such as points on the contours of eyebrows, eyes,
nose, lips. While a single feature keypoint is not required to be semantically
meaningful, each semantic keypoint has a particular meaning for the observed
object. For example, keypoints are defined as human body joints (e.g., wrist,
ankle, hip) or bird parts (e.g., belly, wing, tail). This kind of keypoint carries
valuable information for object recognition, object detection, and pose estimation.
In this thesis, we focus on the problem of localizing the semantic keypoints.
Considerable efforts have been devoted to developing a strong part detector
together with a spatial model for keypoint localization. While early methods
focus on designing handcrafted feature or developing a graphical model to model
spatial constraints [56,73,75,116], recent deep-learning-based methods have taken
the place of handcrafted features and explicit spatial models with only represen-
tation learning [35, 87, 92]. However, these methods usually supervise keypoint
detectors using a confidence map generated from ground-truth keypoint locations.
1
Furthermore, the maximum achievable localization accuracy differs from keypoint
to keypoint, because it is determined by the underlying keypoint structures. For
example, the keypoints with distinctive appearances, such as the shoulders and
head, can be easily detected with high accuracy, while the keypoints with am-
biguous appearance such as an occluded ankle, have much lower localization ac-
curacies. Thus, the keypoint detector often fails to detect ambiguous keypoints if
trained with strict supervision, that is, permitting only a small localization error.
Training with looser supervision could help detect the ambiguous keypoints, but
this comes at a cost to localization accuracy for those keypoints with distinctive
appearances. In this thesis, we propose hierarchically supervised nets (HSNs), a
method that imposes hierarchical supervision within deep convolutional neural
networks (CNNs) for keypoint localization. To achieve this, we firstly propose a
fully convolutional Inception network [92] with several branches of varying depths
to obtain hierarchical feature representations. Then, we build a coarse part de-
tector on top of each branch of features and a fine part detector which takes
features from all the branches as the input.
Also, For the task of keypoint localization, collecting image data with key-
point annotations is harder than with image labels. One may collect images from
Flickr or Google images by searching keywords and then perform refinement pro-
cesses to build a classification dataset, while keypoint annotation requires human
to click the rough location of the keypoint for each image. Considering the prob-
lem of insufficient part annotations, we aim to design a part detector which can
be trained on data without part annotation. We achieve this by combining deep
representation learning and domain adaptation within the same training process.
To learn feature representations that are discriminative to object parts but in-
variant to the domain shift, we train the network by minimizing the loss of the
part classifier and maximizing the loss of domain classifier. The former enforces
the network to learn discriminative features, while the latter encourages learning
features invariant to the change of domain.
It is also worth nothing that the technique of part localization has been used
to boost the performance in many tasks including object detection [24, 53, 130]
and recognition [2, 82], especially for fine-grained categorization, where subtle
differences between fine-grained categories mostly reside in the unique properties
2
of object parts [6, 16, 62, 78, 120, 126]. Therefore, we explore the application of
keypoint localization in the task of fine-grained visual categorization. We do this
by learning a new part-based CNN that models multiple object parts in a unified
framework. The proposed method consists of a localization module to detect
object parts (where pathway) and a classification module to classify fine-grained
categories at the subordinate level (what pathway).
1.2 Problems and Challenges
1.2.1 Keypoints Localization
Keypoints localization is generally formulated as probabilistic problem of esti-
mating the posterior distribution p(x|z), where x is the representation of the
keypoints and z is the image features. Therefore, the primary research in key-
points localization can be divided into three categories:
• The models for the representation of the keypoints - x
• The methods for feature extraction and encoding from images - z
• The inference approaches to estimate the posterior - p(x|z)
Keypoint Representation. There are many ways to represent the keypoints
by considering the structural dependencies among them. The most simple way
is to parameterize the keypoints by their spatial locations. For example, x =
{p1, p2, ..., pN}. However, this representation is variant to the morphology of a
given individual. To obtain invariant representation, many methods [3, 4, 72, 73]
encode the keypoints as a kinematic tree, x = {τ, θτ , θ1, θ2, ..., θN}, where τ is the
root node, θτ is the orientation of the root node, {θi}Ni=1 represents the orientations
of other keypoints with respect to the root node. Alternatively, Non-tree models
have been introduced to model the keypoints as a set of parts, x = {x1, x2, ..., xN},where each part encodes informations including spatial position, orientation, and
scaling, i.e., xi = {τi, θi, si}.Image Features. Image feature extraction is an indispensable component of the
keypoint localization system. Over the years, many hand-crafted features such
3
as SIFT [59] or HoG [19] has been used to model the salient parts of the image.
In recent years, deep feature representation has been widely used to boost the
performance of parts/joints detection to a new level. Toshev et al. [97] propose
a cascade CNN for keypoint regression. in [95], multiple sizes of filter kernels
are used to simultaneously capture features across scales. Similar to this, [67]
upsamples the feature maps of lower layers and stacks them with that of higher
layers.
Inference. There are many methods proposed to characterize the posterior
distribution for the inference stage. We can divided these methods into three
groups: discriminative models, generative models, and part-based models. Dis-
criminative methods have been demonstrated be very effective for pose estima-
tion [1, 10, 42, 66, 83, 85, 101]. This class of methods learn the parameters of the
conditional distribution p(x|z) from the given training data. For example, the
simplest method, linear regression [1] first assumes that the body configuration
x is represented by a linear combination of the image features, z, with additive
Gaussian noise, that is,
x = A[z − μz] + μx + υ, (1.1)
where υ ∼ N(0,Σ),μx = 1N
∑Ni=1 xi and μz = 1
N
∑Ni=1 zi. Then the conditional
distribution is obtained by:
p(x|z) = N([z − μz] + μx,Σ). (1.2)
Alternatively, the posterior distribution is usually expressed as a product of
a likelihood and a prior in the category of generative models, that is:
p(x|z) ∝ p(z|x)p(x). (1.3)
Most methods in this group adopt the maximum a posteriori probability (MAP)
method to search for the most probable configurations with high prior probability
and likelihood:
xMAP = argmax p(x|z) (1.4)
4
This class of methods has not been widely used for pose estimation because of
the high inference complexity. Therefore, part-based models have been introduced
to reduce the search space by representing a pose as a set of parts with connection
constraints. For instance, The body configuration can be represented a Markov
Random Field (MRF), in which body parts are considered as nodes and potential
functions are used to encode the spatial dependencies between parts. Thus, the
posterior, p(x|z) is given as:
p(x|z) ∝ p(z|x)p(x)= p(z|x1, c2, ..., xM)p(x1, c2, ..., xM)
=M∏i=1
p(z|x1)p(x1)∏
(i,j)∈Ep(xi, xj)
(1.5)
In such case, many message-passing methods, such as Belief Propagation (BP),
are used to solve the inference problem efficiently.
1.2.2 Human Pose Estimation
The task of human pose estimation aims to recover the body configuration from
image features. As shown in Figure 1.2, the key step for this task is to localize
the body joints, with which we can depict the limbs and understand a person’s
posture in images. Human pose estimation is a very active research topic in
computer vision because many real-world applications can benefit tremendously
from such a technology. For instance, human pose estimation can be used to
analyze human behaviors in smart surveillance systems, to help health-care robots
in detecting fallen people, to develop animation in making movies, and to interact
with computers in playing game, even many driver assisting systems utilize this
technique to monitor the drivers’ pose for safety driving.
Despite the exhaustive research, pose estimation remains a challenging task
in computer vision mainly due to following reasons (see Figure 1.1):
• extremely deformable body.
• self-occlusion, where body parts occlude each other.
5
eyes ears nose
shoulders
hips knees
ankles elbows wrists
Figure 1.1: Illustrating the pose estimation problem.
6
Figure 1.2: Illustrating the challenges of human pose estimation.
7
Figure 1.3: Illustrating the bird part localizatoin problem.
• highly variable appearance due to clothing, lighting, body size, shape, etc.
• pose ambiguities due to blur, background clutter, apparent similarity of
parts, loose clothing, etc.
• crowding
1.2.3 Bird Part Localization
Part localization models has achieved tremendous success on object detection
[24,53,130] and recognition [2,82] on many occasions. In particular, part models
play a remarkable role in fine-grained categorization (e.g., birds [23, 37, 104, 119,
121, 124], dogs [55, 71], butterflies [108], etc.), since part usually contains subtle
differences used as the main clues to distinguish fine-grained objects. In this
8
thesis, we use birds as the test case with the goal of localizing the parts across
species (see Figure 1.3. Though remarkable progresses has been made to bird
part localization, this task remains a challenge problem. Major difficulties in
detecting bird parts include:
• the extreme variations in pose (e.g., walking, perching, flying, swimming,
etc.)
• large variations in appearance across species.
• part ambiguities due to some parts approximate to each other.
• background cluster
1.3 Convolutional Neural Network
Convolutional neural networks (CNNs, or ConvNets) are a biologically-inspired
variation of traditional multilayer perceptrons (MLPs). Different to MLPs, CNNs
share weights of the connections between neurons. This sharing strategy can sig-
nificantly reduce the number of trainable parameters hence increase the learning
efficiency. The canonical CNN architecture developed by yann LeCun [49] is
first designed to recognize visual pattern from images in 1997. However, CNNs
were not widespread until 2012, when Krizhevsky et al. [47] achieve remarkable
performance on the ImageNet 2012 classification benchmark with CNNs. Since
then, CNNs have been successfully applied to various of applications in computer
vision. Meanwhile, recent works on more advanced and deeper architecture such
as VGG [87], Inception [38, 91–93], and ResNet [35] further foster research on
convolutional neural networks.
A convolutional neural network normally consists of three types of layers which
are convolutional, pooling, and fully connected layers. The convolutional layer
aims to detect important patterns from the previous layer, while the pooling layer
is acting like filtering or merging the patterns to obtain more robust features.The
fully connected layer is generally used to map the convolutional features to clas-
sification scores.
9
Convolutional layers are the key components in CNNs. Each convolutional
layer consist of a group of learnable filters. These filters are small spatially but
extends through all the channels of the input volume. During the forward pass,
each filter is first convolved with the input volume and produce a corresponding
feature map, then the feature maps stacked along the depth dimension form the
output volume of next layer.
1.4 Fine-grained Visual Categorization
Fine-grained visual categorization (FGVC) refers to the task of identifying ob-
jects from subordinate categories and is now an important subfield in object
recognition. FGVC applications include, for example, recognizing species of
birds [8, 105, 110], pets [44, 71], flowers [5, 68], and cars [61, 89]. Lay individu-
als tend to find it easy to quickly distinguish basic-level categories (e.g., cars or
dogs), but identifying subordinate classes like ”Ringed-billed gull” or ”California
gull” can be difficult, even for bird experts. Tools that aid in this regard would
be of high practical value.
While numerous attempts have been made to boost the classification accuracy
of FGVC [11,16,21,52,107], an important aspect of the problem has yet to be ad-
dressed, namely the ability to generate a human-understandable ”manual” on how
to distinguish fine-grained categories in detail. For example, ecological protection
volunteers would benefit from an algorithm that could not only accurately classify
bird species but also provide brief instructions on how to distinguish very similar
subspecies (a ”Ringed-billed” and ”California gull”, for instance, differ only in
their beak pattern), aided by some intuitive illustrative examples. Existing fine-
grained recognition methods that aim to provide a visual field guide mostly follow
a ”part-based one-vs.-one features” (POOFs) [6–8] routine or employ human-
in-the-loop methods [12, 48, 102]. However, since the amount of available data
requiring interpretation is increasing drastically, a method that simultaneously
implements and interprets FGVC using deep learning methods [47] is now both
possible and advocated.
It is widely acknowledged that the subtle differences between fine-grained cate-
gories mostly reside in the unique properties of object parts [6,16,62,78,120,126].
10
Therefore, a practical solution to interpreting classification results as human-
understandable manuals is to discover classification criteria from object parts.
Some existing fine-grained datasets provide detailed part annotations including
part landmarks and attributes [61, 105]. However, they are usually associated
with a large number of object parts, which incur a heavy computational bur-
den for both part detection and classification. From this perspective, a method
that follows an object part-aware strategy to provide interpretable prediction cri-
teria at minimal computational effort but deals with large numbers of parts is
desirable. In this scenario, independently training a large convolutional neural
network (CNN) for each part and then combining them in a unified framework is
impractical [120].
1.5 Contributions and Thesis Outline
In this thesis, we investigate three questions related to the task of keypoint local-
ization, which are: 1) how to design a good CNN architecture with accuracy and
efficiency for keypoint localization. 2) How to utilize the data without part anno-
tation in training for keypoint localization? 3) How to incorporate the technique
of keypoint localization into the fine-grained categorization system?
1.5.1 Contributions
• We propose the hierarchically supervised nets (HSNs) for keypoint local-
ization, a method that imposes hierarchical supervision within deep convo-
lutional neural networks (CNNs) for keypoint localization. The approach
significantly outperform the state-of-the-art methods on both bird part de-
tection and human pose estimation.
• We present a method that learns deep representation while performing do-
main adaption to address the problem of insufficient annotation data.
• With the technique of keypoint localization, we propose a part-stacked CNN
architecture which achieves state-of-art performance on the CUB200-2011
benchmark dataset.
11
1.5.2 Outline
The outline of the dissertation is as follows:
Chapter 2 presents the idea of using hierarchical supervisor signals within deep
convolutional neural networks (CNNs) for keypoint localization. We introduce the
HSN architecture and describe the details of each component. We also evaluate
the efficacy and generality of our method by conducting experiments on the CUB-
200-2011 bird dataset and the MSCOCO Keypoint dataset.
Chapter 3 focuses on the problem of transferring semantic parts across fine-
grained species. We propose a method that combines part detection and domain
adaptation in the same learning pipeline for keypoint localization. This chapter
first introduce the detailed design of our method. Then, to investigate how many
species of images are sufficient to learn a part detector, we perform a quantitative
evaluation on CUB200-2011. We also evaluate our method on the setting of
transferring parts across datasets.
Chapter 4 explores the effectiveness of using part localization technique in ad-
dressing the problem of fine-grained visual categorization. This chapter presents
two CNN architectures based on the idea of cropping part features for classi-
fication. We also present experimental results and a thorough analysis of the
proposed methods. Specifically, we evaluate the performance from four different
aspects: localization accuracy, classification accuracy, inference efficiency, and
model interpretation.
12
13
Chapter 2
Hierarchically Supervisided Nets
for Keypoint Localization
In this chapter, we propose hierarchically supervised nets (HSNs), a method
that imposes hierarchical supervision within deep convolutional neural networks
(CNNs) for keypoint localization. Recent CNN-based keypoint localization meth-
ods supervise detectors using a confidence map generated from ground-truth key-
point locations. However, the maximum achievable localization accuracy varies
from keypoint to keypoint, as it is determined by the underlying keypoint struc-
tures. To account for this kind of diversity, we propose to supervise part detec-
tors built on hierarchical features in CNNs using hierarchical supervisor signals.
Specifically, we develop a fully convolutional Inception network composed of sev-
eral branches of coarse detectors, each of which is built on top of a feature layer in
CNNs, and a fine detector built on top of multiple feature layers. These branches
are supervised by a hierarchy of confidence maps with different levels of strictness.
All the branches of detectors are unified principally to produce the final accurate
keypoint locations. We demonstrate the efficacy, efficiency, and generality of our
method on several benchmarks for multiple tasks including bird part localization
and human body pose estimation. Especially, our method achieves 72.2% AP on
the 2016 COCO Keypoints Challenge dataset, which is an 18% improvement over
the winning entry.
14
(a) (c)(b)
Figure 2.1: An illustration of the predicted keypoints from our HSN architecture.The left image contains highly accurate keypoints detected by the fine detectorwith strict supervision, the middle image contains keypoints from coarse detec-tors with loose supervisions, and the right image shows the final predictions byunifying the fine and coarse detectors.
15
2.1 Introduction
Predicting a set of semantic keypoints, such as human body joints or bird parts,
is an essential component of understanding objects in images. For example, key-
points help align objects and reveal their subtle difference that is useful for han-
dling the problems with small inter-class variations such as fine-grained catego-
rization [37,119,124]. Also, the key component of human pose estimation system
is to localize the body joints [74,88,95], with which we can depict the limbs and
understand a person’s posture in images.
Despite dramatic progress over recent years, keypoint prediction remains a
significant challenge due to appearance variations, pose changes, and occlusions.
For instance, the local appearances of bird parts may differ vastly across species
or different poses (e.g . perching, flying, and walking). Localizing keypoints on
the human body must be invariant to appearance changes caused by factors like
clothing and lighting, and robust to large layout changes of parts due to articu-
lations [95]. To tackle these difficulties, early works combined handcrafted part
appearance features and with an associated spatial model to capture both local
and global information [56, 73, 75, 116]. Recently, convolutional neural networks
(CNNs) [35,87,92] have significantly reshaped the conventional pipeline by replac-
ing handcrafted features and explicit spatial models with more powerful learned
hierarchical representations [67, 84, 95, 109]. The hierarchical representations in
CNNs provide us a natural way to implicitly model part appearances and spa-
tial interactions between parts. Thus, considerable effort has been placed into
leveraging hierarchical features in CNNs to build a fine keypoint detector which
is expected to possess high localization accuracy [14,70].
Existing CNN-based keypoint localization methods usually supervise keypoint
detectors using a confidence map generated from ground-truth keypoint locations.
However, the maximum achievable localization accuracy differs from keypoint to
keypoint, because it is determined by the underlying keypoint structures. For
example, the keypoints with distinctive appearances, such as the shoulders and
head, can be easily detected with high accuracy, while the keypoints with am-
biguous appearance such as an occluded ankle, have much lower localization ac-
curacies. Thus, the keypoint detector often fails to detect ambiguous keypoints if
16
trained with strict supervision, that is, permitting only a small localization error.
Training with looser supervision could help detect the ambiguous keypoints, but
this comes at a cost to localization accuracy for those keypoints with distinctive
appearances.
In this chapter, we propose hierarchically supervised nets (HSNs), a method
that imposes hierarchical supervision within deep convolutional neural networks
(CNNs) for keypoint localization. To achieve this, we firstly propose a fully
convolutional Inception network [92] with several branches of varying depths to
obtain hierarchical feature representations. Then, we build a coarse part detector
on top of each branch of features and a fine part detector which takes features
from all the branches as the input.
These detectors have different localization abilities and are complementary to
each other. The shallower coarse detectors can produce accurate localizations of
keypoints with distinctive appearances; however, they often fail to detect key-
points with ambiguous appearances. The deeper branches can infer the approx-
imate locations of ambiguous keypoints but at the cost of reduced localization
accuracy for the unambiguous keypoints. Thus, we supervise these branches of
detectors using a hierarchy of confidence maps with strictness levels that are set
according to the localization abilities of the branches. By supervising the part
detectors built on hierarchical features with hierarchical supervisor signals, our
HSN fully explores the diversities of part structures and the diversities of repre-
sentations in CNNs.
Each HSN branch produces keypoints with various localization accuracies,
which are unified to produce the final keypoint locations. As shown in Figure
4.1, the finally detected keypoints include very accurate ones detected by the
fine detector and approximately accurate ones detected by the coarse detectors.
The proposed HSNs outperforms state-of-the-art approaches by a large margin
on bird part localization and human pose estimation datasets.
Our main contributions include: (a) we present a strategy of using receptive
fields as candidate boxes to facilitate part localization,(b) we obtain multi-scale
feature representation by concatenating feature maps from multi-level layers with
multiple filter sizes. (c) We design a unified approach to combine the prediction
from multiple detectors, and (d) we introduce a novel framework for generality,
17
efficiency, and accuracy. We outperform state-of-the-art approaches by a large
margin on the datasets of bird part localization and human pose estimation.
We achieve 88% PCK0.1 and 71.0% PCK0.05 which are 3% and 12% higher
respectively than the previous best methods on the CUB2011 dataset. We also
achieve 72.2% mAP on the 2016 COCO Keypoints Challenge dataset which is an
18% improvement over the winning entry.
2.2 Related Works
2.2.1 Bird part detection
Bird parts play a remarkable role in fine-grained categorization, especially in
bird species recognition where parts have subtle differences. Early works focused
on developing handcrafted part appearance features (e.g ., HOG [19]) and spatial
location models (e.g . pictorial models [25]) to capture both local and global infor-
mation. For example, the deformable part model (DPM) [24] has been extended
for bird part localization by incorporating strong supervision or segmentation
masks [16, 122]. Chai et al . [16] demonstrated that DPM-based part detection
and foreground segmentation aid each other if the two tasks are performed to-
gether. Liu et al . [54, 56] presented a nonparametric model called exemplar to
impose geometric constraints on the part configuration. However, the constraints
enforced by pictorial structure are sometimes not strong enough to combat noisy
detections. Liu et al . [54, 56] presented a nonparametric model called exemplar
to impose geometric constraints on the part configuration. Liu et al . [54] pre-
sented a nonparametric model called exemplar that enforced pose consistency and
subcategory consistency and transformed the problem of part detection to image
matching. Liu et al . [56] built pair detectors for each part pair from part-pair
representations, combing non-parametric exemplars and parametric regression
models.
More recently, convolutional neural networks (CNNs) based methods have in-
creasingly be used for this task. Inspired by object proposals in object detection,
part-based R-CNN [120] extracts CNN features from bottom-up proposals and
learns whole-object and part detectors with geometric constraints. Following this
18
strategy,Shih et al . [84] employed EdgeBox method [132] for proposal generation
and performed keypoint regression with keypoint visibility confidence. To fur-
ther improve the performance of parts detection, Zhang et al . [119] introduced
K-nearest neighbors proposals generated around bounding boxes with geometric
constraints from the nearest neighbors. These methods significantly outperform
conventional approaches; however, the proposal generation and feature extrac-
tion are computationally expensive. Our approach avoids proposal generation by
adopting the fully convolutional architecture which was originally proposed for
dense prediction tasks like semantic segmentation [57].
2.2.2 Human pose estimation
Classical approaches to articulated pose estimation adopt graphical models to
explicitly model the correlations and dependencies of the body part locations
[3,20,43,72,94,116]. These models can be classified into tree-structured [3,73,90,
94], and non-tree-structured [20, 43] models. Attempts have also been made to
model complex spatial relationships implicitly based on a sequential prediction
framework which learns the inference procedure directly [73, 75].
Again, the advent of deep CNNs have recently contributed to significant im-
provements in feature representation and have significantly improved human pose
estimation [15, 17, 67, 74, 95, 97, 109, 115]. Toshev et al . [97] directly regressed
x, y joint coordinates with a convolutional network, while more recent work re-
gressed images to confidence maps generated from joint locations [15,67,95,109].
Tompson et al . [95] jointly trained a CNN and a graphical model, incorporat-
ing long-range spatial relations to remove outliers on the regressed confidence
maps. Papandreou et al . [70] proposed to use fully convolutional ResNets [35]
to predict a confidence map and an offset map simultaneously and aggregated
them to obtain accurate predictions. Other works adopted a sequential proce-
dure that refined the predicted confidence maps successively using a series of
convolutional modules [15, 67, 109]. Cao et al . [14] proposed a pose estimation
framework which adopts both explicit spatial modeling and implicit sequential
predictions. In contrast to existing approaches, our approach models the part ap-
pearance and spatial relationships using a single network with several branches to
19
capture multi-scale information, which is more efficient because it requires no ex-
plicit graphical model-style inference or sequential refinement. Also, we generate
the confidence maps used for supervision according to the localization capability
of each branch.
2.3 Hierarchically Supervised Nets
In this section, we introduce the HSN architecture and describe the details of
each component. We cast keypoint localization as a part detection problem, in
which a subset of parts from a set of candidate regions is selected and labeled
with part classes, e.g . “shoulder,” “ankle,” and “knee.” As illustrated in Figure
3.2, the proposed framework consists of shared base convolutional layers and two
streams of part detectors. The coarse stream consists of three coarse detectors
branches, each of which only inputs features within a specific scale range induced
by the Inception modules. The main difference in these branches is the number
of stacked inception modules, leading to different receptive field sizes. Smaller
receptive fields focus more on capturing local appearances, while larger ones are
more suitable for modeling the spatial dependencies between parts. Therefore we
concatenate feature maps from all the coarse detectors to learn a fine detector
that is expected to provide very accurate localizations. Finally, we learn the
entire network using hierarchical confidence maps, each of them has a strictness
level varying with the localization ability of the corresponding detector.
2.3.1 Network Architecture
Our detection network simultaneously predicts multiple part locations from the
input image. We implement this by following the “recognition using regions”
paradigm [34], which is widely used in object detection [77]. We predefine a set
of square boxes as part candidate regions to perform part localization and feature
extraction concurrently in a network forward pass.
Stride, receptive fields, and depth. We build the detector based on Inception-
v2 [92], a deep neural network architecture that has achieved impressive perfor-
mance in object recognition. In a convolutional network, the stride and receptive
20
Coarse Stream
Fine Stream
softmax
softmax
softmax
max pool
inception(3c)
conv
inception(4a)inception(4a)
inception(4b)
inception(4a)
inception(4b)
inception(4c)
GGaussian Kernel
convconv
….
….
28x28xC
….
….
28x28xC
conv
….
….
28x28xC
conv
….
….
28x28xC
loss(4c)
loss(4b)
loss(4a)
loss softm
ax
G
G
G
G
Figure 2.2: Network architecture of the hierarchically supervised nets. The coarse stream learns three coarse detectorsusing hierarchical supervisions and while the fine stream learns a fine detector via strict supervision. Then the coarsepredictions and fine predictions are unified for final prediction in inference stage.
21
field sizes increase with depth. Thus, deeper layers encode richer contextual
information to disambiguate different parts at the cost of reduced localization
accuracy. To balance part classification and localization accuracy, we employ the
features in the Inception (4a-4c) layers to train the three coarse detectors. The
stride of the Inception (4a-4c) layers is 16, and the corresponding receptive field
sizes are 107× 107, 139× 139, and 171× 171, respectively. Given an input image
of size 224x224, the receptive-field size in deeper layer is too large for a part and
may lead to ambiguous detections for closely positioned parts. Thus we increase
the input resolution of the network to 448× 448 so that the receptive field sizes
are appropriate to enclose candidate part regions.
Candidate part regions. To avoid a sliding-window search for possible part
locations, we propose to first identify candidate part regions as done in object
detection. In object detection, the candidate object regions are obtained by
generating region proposals of various sizes and aspect ratios. However, keypoint
localization only aims to infer the central location of parts so does not require a
bounding-box, which bounds the parts tightly. Thus, we define the part regions
as squared regions centered at the ground truth locations, thus removing the need
to generate region proposals. Put another way, we assume that all parts have the
same bounding box size and use the regions enclosed by receptive fields (RFs) at
all positions in the feature map as candidate regions. For example, the size of the
Inception (4a) feature map is 28× 28, which means that there are 784 candidate
regions of size 107× 107, which are uniformly spaced on the input image.
Feature representation. Using regions enclosed by receptive fields as candidate
part regions simplifies the feature extraction for part detectors. In the proposed
fully convolutional network, the cross-channel vector at a spatial position in the
feature map is used as a feature for the candidate part region associated with
that position. This strategy is efficient as it does not require RoI pooling from
bounding-box features as done in object detection. Also, the fine detector relies
on multi-scale representations by fusing multiple feature layers each of which is
processed by multiple filter sizes through Inception modules. Therefore, the fine
detector in our network can model the appearance of the object parts by features
from a large number of scales. As shown in Figure 2.3, there have been three
popular types of methods to obtain multi-scale representations. The first type of
22
methods (Figure 2.3 (a)) resize the images to multiple resolutions and extract the
pyramid features. The second type of methods, as illustrated in Figure 2.3 (b),
adopt different sizes of convolutional filters. For example, GoogLeNet [92] learns
multiple filters (such as 1 × 1, 3 × 3, and 5 × 5) and concatenates their feature
maps. The last type upsample the feature maps from higher layers to fit the size
of intermediate feature maps, and then all feature maps from different layers can
be concatenated to form the multi-scale representation. In contrast, our method
(Figure 2.3 (d)) obtains multi-scale representations by fusing multiple layers each
of which is processed by multiple filter sizes. Specifically, we stack feature maps
from consecutive Inception layers with no downsampling, which allows concate-
nating features from different layers without using upsampling techniques such
as the deconvolutional network. Therefore, the fine detector in our network can
model the appearance of the object parts by features from a large number of
scales.
Hierarchical supervisions. To fully explore the diversities of hierarchical rep-
resentations in CNNs, we simultaneously learn all detectors using the hierarchical
supervisions. As shown in Figure 2.3, each detector has its own appropriate su-
pervision generated according to receptive field size. Specifically, we generate con-
fidence maps for a detector by calculating the intersections between the candidate
part regions and the ground truth part regions. Let Kc = {1, . . . , K} be the set of
part classes, and D denote the number of coarse detector branches. Given an out-
put feature map in the d-th branch with sizeW×H, stride s, offset padding p, and
receptive field size r, each location (w, h) in the output feature map corresponds
to a receptive field rf(w, h) centered at position (w∗, h∗) = (w, h)∗s−(p−1)+r/2
in the input image. For an annotated keypoint location (i, j) with class k ∈ Kc,
we define a ground truth region gtk(i, j) with size r× r centered at (i, j). To con-
struct a target response map Y d for the d-th detector branch, we set Y d(w, h) = k
if the candidate region rf(w, h) has an Intersection-over-Union (IoU) higher than
0.5 with the ground truth region gtk(i, j) and set Y d(w, h) = 0 to classify it as
the backgound otherwise. For the fine detector, we generate a strict supervision
map by setting Y f (w, h) = k if ‖ (w∗, h∗)− (ik, jk) ‖2≤ 0.05 ∗ ref lengh and set
Y f (w, h) = 0 otherwise, where ref lengh is the longer side of the object bounding
box. The confidence map hierarchy generated for the detector branches enable
23
(a) (b)
(d)(c)
Figure 2.3: Different methods for obtaining multiple-scale . (a) Input multipleresolution images. (b) Using different size of convolutional filters (c) concatena-tion of different resolutions of feature maps. (d) concatenation of feature mapsfrom different layers, each of which has multiple convolutional filters.
24
Table 2.1: Comparison with methods that report per-part PCK(%) and averagePCK(%) on CUB200-2011. The abbreviated part names from left to right are:Back, Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape,Right Eye, Right Leg, Right Wing, Tail, and Throat
α Methods Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Mean
0.1[124] 85.6 94.9 81.9 84.5 94.8 96.0 95.7 64.6 67.8 90.7 93.8 64.9 69.3 74.7 94.5 83.6Ours 88.3 94.5 87.3 91.0 93.0 92.7 93.7 76.9 80.5 93.2 94.0 81.2 79.2 79.7 95.1 88.0
0.05[124] 46.8 62.5 40.7 45.1 59.8 63.7 66.3 33.7 31.7 54.3 63.8 36.2 33.3 39.6 56.9 49.0[118] 66.4 49.2 56.4 60.4 61.0 60.0 66.9 32.3 35.8 53.1 66.3 35.0 37.1 40.9 65.9 52.4Ours 64.1 87.9 57.9 65.8 80.9 83.9 90.3 58.0 50.9 79.4 89.6 62.6 51.0 57.9 84.9 70.9
0.02[124] 9.4 12.7 8.2 9.8 12.2 13.2 11.3 7.8 6.7 11.5 12.5 7.3 6.2 8.2 11.8 9.9[118] 18.6 11.5 13.4 14.8 15.3 14.1 20.2 6.4 8.5 12.3 18.4 7.2 8.5 8.6 17.9 13.0Ours 19.6 40.7 15.7 19.0 33.1 36.0 47.8 20.1 13.1 28.9 47.1 20.9 14.4 18.3 34.1 27.3
Table 2.2: Comparison of PCP(%) and over-all PCP(%) on CUB200-2011. Theabbreviated part names from left to right are: Back, Beak, Belly, Breast, Crown,Forehead, Eye, Leg, Wing, Nape,Tail, and Throat
Methods Ba Bk Be Br Cr Fh Ey Le Wi Na Ta Th Total
[54] 62.1 49.0 69.0 67.0 72.9 58.5 55.7 40.7 71.6 70.8 40.2 70.8 59.7[56] 64.5 61.2 71.7 70.5 76.8 72.0 70.0 45.0 74.4 79.3 46.2 80.0 66.7[84] 74.9 51.8 81.8 77.8 77.7 67.5 61.3 52.9 81.3 76.1 59.2 78.7 69.1
Ours(final) 82.2 57.4 81.3 80.3 75.6 63.0 62.5 70.8 70.8 81.1 59.7 73.5 72.1
detection of keypoints at various localization accuracy levels.
2.3.2 Learning and Inference
We build diversified part detectors using fully convolutional architectures with
different depths and supervisions. For efficient inference, we simultaneously learn
all the detection networks with shared base convolutional layers by minimizing a
multi-task loss.
Learning. Let σd = ϕ(X,W,Φd,Φdcls) be the last feature maps of size W×H×C
in the d-th detector branch given input image X, shared weights W , unshared
weights Φd in the feature layers, and unshared weights Φdcls in the classifier layer,
respectively. We add one more channels to model the background class and
thereby C = (|Kc| + 1). We use the hierarchical confidence maps described in
Figure 2.3 as supervisions. Here, we compute the prediction score at the position
25
Figure 2.4: An illustration of
26
(w, h, k) in the last feature maps using the softmax function:
Prod(w,h,k) =exp(σd
(w,h,k))∑k∈{0,...,Kc}
exp(σd(w,h,k))
. (2.1)
Therefore, the loss function on a training image for each branch is defined as
bellow:
�(X,W,Φd,Φdcls, Y
d) =
−1
W ×H
W−1∑w=0
H−1∑h=0
|Kc|∑k=0
1{Y d(w,h) = k}log(Prod(w,h,k)).
(2.2)
The loss function �(X,W,Φf ,Φfcls, Y
f ) for the fine detector is defined similarly
as Eqn. 2.2. Then we use a multi-task loss to train all the coarse detectors and
the fine detector jointly:
£(Ω, Y ) =D∑
d=1
�(X,W,Φd,Φdcls, Y
d)
+�(X,W,Φf ,Φfcls, Y
f ),
(2.3)
where Ω = {W, {Φd,Φdcls}Dd=1,Φ
fcls}, Φf = {Φd}Dd=1, and Y = {{Y d}Dd=1, Y
f}.Inference. For each detector in the inference stage, we obtain the prediction
scores for all candidate regions through Eqn.3.1. Then we compute the prediction
map Od for each part as follows:
Od(w, h, k∗) =
⎧⎨⎩
1 if argmaxk
Prod(w,h,k) = k∗
0 otherwise.(2.4)
As we use loose supervision for each detector, the results Od have multiple
predicted locations for each part. According to the overlapping receptive field
mechanisms in CNNs, the most precise prediction is around the center of the
predicted locations. Therefore, we obtain a “blur” prediction by convolving the
prediction maps with a 2D Gaussian kernel G and select the location with the
maximum value in the k-th channel as the unique prediction (w∗k, h
∗k) for the k-
27
Ground Truth
Our Predictions
Ground Truth
Our Predictions
Figure 2.5: Bird part detection results with occlusion,viewpoint, clustered back-ground, and pose from the test set.
28
th part. Meanwhile, considering some object parts may be invisible, we set a
threshold θ that controls if the predicted location is a part or background pixel.
Let g(:, :, k) = Od(:, :, k∗) ∗ G, the inferred part locations are given as:
(w∗k, h
∗k) =
⎧⎨⎩
argmaxw,h
g(w, h, k) if Prod(w∗,h∗,k) > θ,
(−1,−1) otherwise.(2.5)
where Pro(w∗, h∗, k) = Pr(Y d(w∗,h∗) = k|σd
(w∗,h∗))
Unified detection. Our system learns four detectors simultaneously and uni-
fies their outputs into the final prediction. The detectors vary in their ability
to detect the object parts. The fine detector tends to output accurate and reli-
able predictions since it receives stacked features from multiple layers. However,
we observe that it may miss predictions of some occluded parts, which can be
detected by the coarse detectors. To predict a set of parts precisely and as com-
pletely as possible, we combine the outputs from the coarse and fine detectors by
using the strategy that the former ones serve as the assistant predictors for the
latter one. Let (w∗k, h
∗k)
d be the kth part prediction with score Prod(w∗,h∗,k) from
the d-th coarse part detector, and (w∗k, h
∗k)
f be the kth part prediction with score
Prof(w∗,h∗,k) from the fine part detector. Then we obtain the unified detection
using the equation bellow:
(w∗∗k , h∗∗
k ) =
{(w∗
k, h∗k)
f if Pro(w∗, h∗, k)f ≥ μ
(w∗k, h
∗k)
d∗ otherwise,(2.6)
where d∗ = argmaxd
Prod(w∗,h∗,k), μ ∈ [0, 1] is a threshold that controls how much
the coarse and fine detectors contribute to the prediction. If μ = 0, only the fine
detector is used for detection, but when μ = 1, the final output is determined by
the coarse detectors.
2.4 Experiments
To evaluate the efficacy and generality of our method, we conduct experiments
on the CUB-200-2011 bird dataset [105] and the MSCOCO Keypoint dataset [51]
29
Table 2.3: Performance comparison between using strict supervision only andhierarchical supervision.
α Methods 4a(%) 4b(%) 4c(%) Fine(%) Unified(%)
0.1Str-super 66.1 59.6 79.9 80.8 83.7Hier-super 79.2 84.9 82.0 80.8 88.0
0.05Str-super 55.6 49.1 66.6 67.4 69.3Hier-super 60.6 59.8 52.4 67.6 71.0
0.02Str-super 22.5 18.8 26.5 26.8 27.3Hier-super 20.9 18.3 14.2 26.5 27.3
Table 2.4: Results on COCO keypoint on test-dev and test-standard split
Method AP AP OKS=0.50 AP OKS=.75 AP medium AP large AR AR oks=.50 AR OKS=.75 AR medium AR largeTest-Dev
CMU-Pose [14] 0.618 0.849 0.675 0.571 0.682 0.665 0.872 0.718 0.606 0.746G-RMI [70] 0.605 0.822 0.662 0.576 0.666 0.662 0.866 0.714 0.619 0.722G-RMI(ext & ens) [70] 0.668 0.863 0.734 0.630 0.733 0.716 0.896 0.776 0.669 0.782DL-61 0.544 0.753 0.509 0.583 0.543 0.708 0.827 0.692 0.753 0.768R4D6 0.514 0.75 0.559 0.474 0.567 0.563 0.77 0.61 0.499 0.649umich vl 0.46 0.746 0.484 0.388 0.556 0.518 0.771 0.546 0.407 0.669belagian 0.419 0.617 0.452 0.3 0.58 0.454 0.63 0.489 0.316 0.639HSNs(ours) 0.726 0.861 0.697 0.783 0.641 0.892 0.944 0.88 0.94 0.872
Test-StdCMU-Pose [14] 0.611 0.844 0.667 0.558 0.684 0.665 0.872 0.718 0.602 0.749G-RMI [70] 0.603 0.813 0.656 0.565 0.674 0.666 0.866 0.717 0.62 0.729G-RMI(ext & ens) [70] 0.658 0.851 0.723 0.629 0.713 0.717 0.895 0.778 0.662 0.792DL-61 0.536 0.756 0.49 0.561 0.542 0.712 0.832 0.694 0.75 0.774R4D6 0.505 0.745 0.554 0.466 0.563 0.563 0.778 0.612 0.499 0.648umich vl 0.438 0.73 0.453 0.364 0.537 0.503 0.762 0.524 0.39 0.652belagian1 0.41 0.607 0.446 0.284 0.576 0.447 0.628 0.485 0.304 0.635HSNs(ours) 0.722 0.857 0.688 0.786 0.637 0.878 0.936 0.865 0.93 0.863
30
. Our approach significantly exceeds the state-of-the-art methods on both two
tasks.
2.4.1 Bird Part Localization
The CUB200-2011 [105] is a widely used dataset for bird part localization. It
contains 200 bird categories and 11, 788 images with roughly 30 training images
per category. Each image has a bounding box and 15 key-point annotations.
To evaluate the localization performance, early approaches [54, 56, 84] mainly
used percentage of correct parts (PCP) measure, in which a correct part location
should be within 1.5 standard deviation of an MTurk workers’ clicks from the
ground truth part locations. Recent methods [118, 118, 124] on this task have
used percentage of correctly localized keypoints (PCK) as the evaluation metric.
According to the PCK criteria used in [124], given an annotated bounding box
with size (w, h), a predicted location is correct if it lies within α ×max(h, w) of
the ground-truth keypoint. Here we adopt both the PCP and PCK criteria and
compare our results to the reported performance of the state-of-the-art methods.
We present the PCP results for each part as well as the total PCP results in Table
2.2. Compared to the methods that report PCP results, our method improves
the overall PCP over the second best approach by about 4.3%. Notably, although
previous methods show poor performance of the ’leg’ and ’back’ part detection,
our method achieves up to 33.8% and 9.8% improvements for the two parts over
the next best method. We also report per-part PCK and mean PCK results
compared with other methods with α ∈ {0.1, 0.05, 0.02} in Table 2.1. Here, a
smaller α means more strict error tolerance in the PCK metric. Our method
outperforms existing techniques at various α setting. This nicely demonstrates
our approach produces more accurate predictions with a higher recall for keypoint
localization. Also, the most striking result is that our approach obtains a 35%
improvement over the second best method using the strict PCK metric. Figure
2.5 shows some results on the CUB200-2011 testing set.
In order to further understand the performance gains provided by our net-
work structure, we also provide intermediate results of using the strict supervi-
sions and the hierarchical supervision. As shown in Table 2.3, using hierarchical
31
Figure 2.6: Pose estimation results with occlusion, crowding, deformation, andlow resolution from the COCO test set.
32
supervisions to learn the parallel convolutional network achieves better perfor-
mance than using the strict supervision alone. This is mainly because imposing
appropriate supervision can significantly improve the accuracy of the coarse de-
tectors, thereby enhance the performance of the unified detection. Moreover, the
performance gain gradually diminishes as α decreases, because coarse detectors
fail to predict very accurate locations and contribute less to the final predictions.
2.4.2 Human Pose Estimation
The MSCOCO Keypoint dataset consists of 100k people with over 1 million total
annotated keypoints for training and 50k people for validation. The testing set is
unreleased and includes ”test-challenge,” ”test-dev,” and ”test-standard” three
subsets, each containing about 20K images. The MSCOCO evaluation defines
the object keypoint similarity (OKS) and use AP(averaged across all 10 OKS
thresholds) as the main metric to evaluate the keypoint performance.
Implementation details. To address the problem of multi-person pose estima-
tion, we adopt the Faster R-CNN framework [77] with a pre-trained model1 on
the MSCOCO dataset to obtain the person bounding boxes. We first crop out
all person instances and resize the long side of each image to 512 pixels while
maintaining its aspect ratio. We pad each resized image with zero pixels and
form a training example of size 512×512. Then we randomly crop the image into
448×448 as the input of the hierarchical supervised nets. We train our model for
300k iterations using SGD with a momentum of 0.9, a batch size of 16, and an
initial learning rate of 0.001 with step decay 100k. We initialize network weights
with a pre-trained model on ImageNet which is available online 2.
Results. We evaluate our methods on the“test-dev” and “test-standard” and
obtain the evaluation results on 10 metrics 3 from the online server 4. We com-
pare our keypoint performance with the results from top teams at the MSCOCO
Keypoint Challenge 2016. As can be seen from the Table 2.4, our performance sig-
nificantly surpasses other methods for most of the ten metrics. Most remarkably,
1https://github.com/rbgirshick/py-faster-rcnn2https://github.com/lim0606/caffe-googlenet-bn3http://mscoco.org/dataset/#keypoints-eval4https://competitions.codalab.org/competitions/12061
33
on the “dev-std” split, we achieve 0.722 AP which is 18% improvement over the
winning team. Furthermore, we achieve comparable results to the method [70] for
metrics APOKS=0.5 and APOKS=0.75. [70] uses extra data and ensemble models,
while our model is trained on provided data only and outperforms this method
by a large margin on the overall AP metric. Notice that the overall AP is the
average AP across all 10 OKS thresholds. Therefore, the significant performance
improvements for the overall AP and AR again demonstrate that our method has
a strong ability to predict accurate keypoint localizations with high recalls. Fig-
ure 2.6 shows some pose estimation results on the MSCOCO testing set. It is also
worth nothing that our caffe [40] implementation of HSN runs at 48 frames/sec
on a TitanX GPU in the inference stage. Our method allows for real-time human
pose estimation together with a fast person detector.
2.5 Conclusion
In this chapter, we have proposed a hierarchical supervised convolutional network
for keypoint localization on birds and humans. Our method fully explores hier-
archical representations in CNNs by constructing a series of part detectors which
are trained using hierarchical supervision. The hierarchical supervision provides
supervision according to the localization ability of the detectors. The outputs of
all the part detectors are unified principally to deliver promising performance for
both bird part localization and human pose estimation. In the future, we will go
on to investigate how to incorporate features to generate hierarchical supervisions
and extend our framework to other challenging tasks.
34
Chapter 3
Transferring Part Locations
Across Fine-grained Categories
Previous chapter shows using hierarchical supervision within deep convolutional
neural network can significantly improve the performance of keypoint localiza-
tion. In this chapter, we focus on the problem of training a part detector using
insufficient annotation data. We address this problem by incorporating the tech-
nique of domain adaptation into deep representation learning. We adopt one of
the coarse detector from HSNs as the baseline and perform a quantitative evalu-
ation on CUB200-2011 and BirdSnap dataset. Interestingly, our method trained
on only 10 species images achieves 61.4% PCK accuracy on the testing set of 190
unseen species.
3.1 Introduction
One of the biggest catalysts for the success of deep learning is the public avail-
ability of massive labeled data. For example, much of the recent progress in
image understanding can be attributed to the presence of large-scale datasets,
such as Imagenet [79] and COCO [51]. Nevertheless, label annotation is a te-
dious and time-consuming process that requires much effort of human, especially
for the keypoint localization task that needs pixel-level annotation. For instance,
the COCO training set for human pose estimation consists of over 100k person
35
instances and over 1 million labeled keypoints (body joints, e.g. eye, shoulder,
and ankle) in total. Recent successes of part-based methods for species recogni-
tion show keypoint annotation have become increasingly important to the task
of fine-grained visual categorization. However, collecting image data with key-
point annotations is harder than with image labels. One may collect images from
Flickr or Google images by searching keywords and then perform refinement pro-
cesses to build a classification dataset, while keypoint annotation requires human
to click the rough location of the keypoint for each image. Also, the local ap-
pearance around the keypoints accounts for the main differences between species.
Therefore, these raise an interesting question: How many species with keypoint
annotation is sufficient?
Recent works address the problem of insufficient annotations using active
learning algorithms that interactively select the most informative samples from
the unlabeled data. These methods have to re-train the model multiple times,
thereby incur a high computational cost. Different from the standard learning
setting in domain adaptation method, [69] propose to use an auto-validation
procedure to perform part transfer learning. This kind of approach first split the
source data into multiple domains to characterize the domain shift and then train
a part detector on these subsets for generability, but did not take full advantage
of the information from the target domain.
In this section, we focus on the problem part transferring across species (As
illustrated in Figure 4.1) and propose a novel method that aims to learn a ”uni-
versal” detector with transferability. Unlike previous works on transferring part
locations that extract fixed feature representations for domain adaptation, we
follow the idea of deep domain adaptation [27] and combines deep representation
learning and domain adaptation within the same training process. We implement
this by imposing a part classifier and a domain classifier on the top of the fully
convolutional neural network(FCN) [57]. To learn feature representations that
are discriminative to object parts but invariant to the domain shift, we train the
network by minimizing the loss of the part classifier and maximizing the loss of
domain classifier. The former enforces the network to learn discriminative fea-
tures, while the latter encourages learning features invariant to the change of
domain.
36
Black Vulture
Magnolia Warbler
Northern Pintail
White-fronted Goose
Black-crested Titmouse
Ring-billed Gull
Western Bluebird
Long-billed Curlew
Blackburnian Warbler
Red-eyed Vireo
Source domain Target domain
Great Egret Northern Cardinal
Figure 3.1: Illustration of the research problem. The source domain contains partannotations, while parts are not annotated in the target domain. Also, the targetdomain contains species which do not exist in the source domain.
37
The main contributions of this section are: 1) we propose a novel method for
transferring part knowledge from unseen species. 2) we also conduct a thorough
analysis to investigate the transferability of models trained on the various number
of species. 3) we provide insights on how many species of images with annotation
may be needed to perform well on unseen species.
3.2 Relate Works
3.2.1 Part Detection.
Recent methods for part detection can be categorized into three groups: strongly
supervised, semi-supervised, and unsupervised. The first directly learn a strong
detector by minimizing the localization error in the training set. Shih et al . [84]
employed EdgeBox method [132] for proposal generation and performed key-
point regression with keypoint visibility confidence. To further improve the per-
formance of parts detection, Zhang et al . [119] introduced K-nearest neighbors
proposals generated around bounding boxes with geometric constraints from the
nearest neighbors. These methods significantly outperform conventional DPM
based approaches [16, 24, 122]. Many methods proposed for pose estimation also
belong to this group. For example, Wei et al . [109] adopted sequential prediction
procedure which refines belief maps from previous stages by incorporating larger
scale information through several training stages. Newell et al . [67] proposed
the hourglass network structure which processes convolutional features across all
layers in a CNN to predict the keypoint locations. However, these methods tend
to overfit the training data and may have generalization difficulty.
The second category explores semi-supervised training regimes to improve the
generalization accuracy of supervised learning approaches. Classic examples in-
cludes leveraging both strongly-supervised deformable part model (DPM) [24]
and a weakly-supervised DPM to facilitate part localization [122] and refining
the part detector using web images [113]. The last group employs the unsuper-
vised scenario to find object parts without the need of any part annotation. Xiao
et al. [111] cluster the channels of last convolutional feature maps into groups,
where responses are strongly related to the part locations. Similarly, Marcel
38
et al. [86] learn a part model using the activation pattern of feature maps but
with constellation constraints. Approaches belong to this group focus on learn-
ing discriminative parts and may fail to address the problem of semantic parts
localization.
3.2.2 Domain Adaptation and Active Learning
Our work also relates to domain adaptation(DA) that learns a classifier from la-
beled data for unseen data by aligning the feature distribution between the source
and target domains. Typical methods used in visual applications comprise learn-
ing feature transformation [26, 30, 33] and parameters adapting [33, 114]. Recent
methods perform domain adaptation in learning deep representation. [99] models
the domain shift in the last layer of convolutional networks. More recently, [27,98]
trains the entire network with an auxiliary classier to learn feature representation
invariant to domain change. This method performs very well in the task of image
recognition. In this chapter, we adapt the similar idea to transfer part localiza-
tion between different domains. The most significant difference between [27] and
our method is: Our method focus on the problem of transferring local knowl-
edge (part locations) by matching feature distributions from different domains,
while [27] address the problem of transferring global knowledge (object label).
Active learning based algorithms aim to select the most informative samples
from the unlabeled data interactively. Therefore, research works in this area focus
on designing different data selection strategies using entropy [96], diversity [36],
and representativeness [41]. Nevertheless, the model need be to retrained with the
selected data after each data selection, which cause high computation in training
the model.
3.3 Our Approach
3.3.1 Model Formulation
As illustrated in Figure 3.2, we implement the proposed method using deep con-
volutional network architecture. The overall architecture consists of three sub-
39
….
….
28x28xK
domain label d
part label p
feature extractor Gd(.; θd) domain classierGd(.; θd)
part classier Gp(.; θp)
loss Lp
loss Ld
∂Lp
∂θp
∂Ld
∂θd
∂Lp
∂θp− λd
∂Ld
∂θd
forwardprop
∂L
∂θbackprop
Figure 3.2: The proposed architecture consist of three components: a featureextractor (yellow), a part classifier, and a domain classifier (blue). All these com-ponents share computation in a feed-forward pass. The feature extractor outputsfeature representation as the input of the other components. The part classifier isdesigned to find the part location, while domain classifier is added to handle thedomain shift between source and target domain. Note that the backpropagationgradients that pass from domain classifier to the feature extractor are multipliedby a negative constant during the backpropagation.
40
networks, which are used for feature extraction, part classification, and domain
classification respectively.
The key idea of our approach is to minimize the localization error in the
source training set while reducing the distribution variance between the source
and target domain.
LetXS andXT be the training set from source and target domain respectively,
Kc = {1, . . . , K} be the set of part classes, and Yd ∈ {0, 1} be the domain label.
Given a input image x, we define Yd = 0 if x ∈ XS or Yd = 1 if x ∈ XT . Given an
output feature map of size W ×H × C, stride s, offset padding o, and receptive
field size r, we now generate the part label map Yp ⊂ W×H by calculating
the intersections between the candidate part regions and the ground truth part
regions. Here each location (w, h) in the output feature map corresponds to a
receptive field rf(w, h) centered at position (w∗, h∗) = (w, h) ∗ s− (o− 1) + r/2
in the input image. We add one more channels to model the background class
and thereby C = (|Kc|+ 1). Then we define a ground truth region gtk(i, j) with
size r × r centered at (i, j) at the annotated keypoint location (i, j) with class
k ∈ Kc. Finally each part label map Yp is generate by setting Yp(w, h) = k if the
candidate region rf(w, h) has an Intersection-over-Union (IoU) higher than 0.5
with the ground truth region gtk(i, j) and setting Yp(w, h) = 0 to classify it as
the background otherwise.
We now define the loss function for the part classifier. We assume the output
feature maps σ from the feature extractor with input image x and parameters
θf , then σ = ϕf (x, θf ), where ϕ is the mapping function. Then we denote the
part classifier mapping ϕp(σ, θp) with parameters θp. Therefore the prediction
score Prop(w,h,k)(x, θf , θp) for kth class at each position (w, h) can be computed as
following.
Prop(w,h,k)(x, θf , θp) =exp(ϕp(σ(w,h,k), θp))∑
k∈{0,Kc}exp(ϕp(σ(w,h,k), θp))
. (3.1)
41
Therefore, the loss function for the part classifier is defined as bellow:
Lp(x, θp, θf , Yp) =
−1
|XS| ×W ×H
∑x∈XS
W−1∑w=0
H−1∑h=0
|Kc|∑k=0
1{Yp(w, h) = k}log(Prop(w,h,k)(x, θf , θp)).(3.2)
Similarly, we define the loss function for the domain classifier. Let ϕd(σ, θd)
be the domain classifier mapping with parameters θd. Then the prediction score
Prod(x, θf , θd) = ϕd(σ, θd), and the loss function is given:
Ld(x, θp, θf , Yd) =
−1
|XS|+ |XT |∑
x∈(XS⋃
XT )
[Yd logProd(x, θf , θd) + (1− Yd) logProd(x, θf , θd)].(3.3)
In the meantime, we expect feature learned from the part classifier is domain-
invariant. That is, we want the feature distribution from the source domain
{ϕf (x, θf )|x ∈ XS} similar to that {ϕf (x, θf )|x ∈ XT} from the target domain.
This can be achieved by learning parameters θf of feature extractor that maxi-
mizes the loss of the domain classifier and the parameters θd that minimize the
loss of the domain classifier [27]. Thus, we formulated the proposed model as
bellow:
E(x, θp, θf , θd, Yp, Yd) = Lp(x, θp, θf , Yp)− λLd(x, θd, θf , Yd), (3.4)
where λ is a positive parameter that controls the trade-off between the discrim-
inative ability and transferability of the learning representation. Higher values
of λ lead to closer feature distribution between source and target domains, but
may harm the performance of part detector. In this chapter, we set the lambda
to 0.95 by empirical tuning.
42
3.3.2 Optimization with Backpropagation
Here, Ld is the loss function that measures the part classification error, while
the Lp is used as a measurement of the classification error for domain label clas-
sification. we adopt the method used in [27] to optimize the object function
Equation(3.4). The saddle point θf ,θd,θp can be defined using the following equa-
tion.
(θf , θp) = argminθf ,θp
E(θp, θf , θd) (3.5)
θd = argmaxθf ,θp
E(θd, θf , θp) (3.6)
Then we can use the gradient decent algorithm to optimize the object function
Equation (3.4) using the saddle point definition from Equations (3.5,3.6):
θf ← θf − μ(∂Lp
∂θp− λd
∂Ld
∂θd), (3.7)
θp ← θp − μ∂Lp
∂θp, (3.8)
θd ← θd − μ∂Ld
∂θd, (3.9)
where μ is the learning rate.
43
Table 3.1: Part transferring results for different splits of CUB200-2011 dataset.Per-part PCKs(%) and mean PCK(%) are given.The abbreviated part namesfrom left to right are: Back, Beak, Belly, Breast, Crown, Forehead,Left Eye,LeftLeg, Left Wing, Nape, Right Eye, Right Leg, Right Wing, Tail, and Throat
Methods Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Mean
Testing on the source domainS(10) 65.4 81.8 59.1 64.4 87.4 81.4 81.7 43.0 45.6 82.6 82.3 43.9 45.5 54.3 85.5 66.9
S(10)+Adap 61.5 83.1 54.7 67.6 85.3 86.1 77.5 53.7 37.3 81.3 86.2 46.3 47.4 60.4 87.1 67.7S(20) 70.9 85.2 69.9 75.9 85.7 82.4 91.8 50.7 50.0 84.9 81.7 57.6 61.0 58.3 90.0 73.1
S(20)+Adap 71.1 84.4 75.3 78.0 85.7 81.2 82.0 55.3 52.3 83.9 85.9 56.1 59.9 61.4 88.9 73.4
S(50) 73.3 85.1 76.7 79.7 85.3 83.6 89.4 59.9 63.1 86.0 87.8 60.9 65.4 66.9 89.7 76.9S(50)+Adap 75.4 84.6 79.8 81.9 84.0 86.7 91.2 59.4 66.7 84.6 89.1 53.2 62.1 70.6 92.8 77.5
S(100) 80.8 86.2 80.4 85.2 88.7 86.5 91.0 71.6 68.0 89.3 88.5 65.2 69.9 73.9 93.1 81.2S(100)+Adap 81.5 86.5 82.2 86.1 88.2 88.5 87.9 69.2 70.6 89.9 88.5 63.9 70.0 72.1 93.4 81.2
Testing on the target domainS(10) 40.0 72.0 50.9 53.5 75.5 68.7 70.3 31.7 32.2 58.7 72.7 30.2 28.2 22.9 71.1 51.9
S(10) +Adap 54.3 81.3 54.2 61.0 84.2 80.6 71.1 43.6 40.6 75.0 82.8 33.3 37.3 37.2 83.7 61.4
S(20) 62.4 81.5 66.9 73.2 81.5 79.4 79.7 44.4 48.9 75.5 79.6 51.2 51.3 40.2 85.6 66.8S(20)+Adap 73.1 84.6 73.3 78.5 84.7 81.5 83.7 53.0 57.4 82.5 83.9 58.2 59.6 54.1 89.6 73.2
S(50) 67.3 84.2 75.5 79.7 84.0 83.5 87.1 68.7 60.7 83.5 87.8 57.0 62.5 55.4 89.9 74.5S(50)+Adap 74.6 85.1 78.3 81.5 84.7 86.2 90.8 56.8 65.5 83.8 87.9 52.5 63.0 60.8 92.7 76.3
S(100) 75.0 85.9 76.7 84.7 86.6 86.2 91.0 66.8 71.8 87.4 88.4 60.6 73.8 66.0 92.3 79.5S(100)+Adap 77.7 85.4 77.4 85.2 87.8 87.4 88.0 62.6 69.5 89.9 89.0 59.0 69.9 64.3 93.5 79.0
3.4 Experiments
3.4.1 Datasets and Setting
Datasets. We evaluate our method on two datasets for fine-grained localization.
(a) CUB200-2011 [105] is a widely used dataset for bird part localization. It
contains 200 bird categories and 11, 788 images with roughly 30 training images
per category. Each image has a bounding box and 15 key-point annotations. (b)
Birdsnap is a larger bird dataset contains 500 bird species with 49,829 images in
total. This dataset also has object bounding box and 11 body part annotation
for each image. To evaluate the localization performance, we use the correctly
localized keypoints (PCK) as the evaluation metric. In PCK, given a ground
truth bounding box of size (w, h), a prediction is selected as true positive if it lies
within αmaxw, h distance from the nearest ground truth keypoint, here α ∈ (0, 1)
controls the error tolerance. In this work, we set α = 0.1 in all the experiments.
Settings. To evaluate the part transferability, we first perform a quantitative
evaluation on the CUB200-2011 dataset. The training dataset is split into source
domain and target domain in four ways, which are 10, 20, 50, 100 species of
44
Table 3.2: Part transferring from CUB200-2011(Source) to BirdSnap(Target).Per-part PCKs(%) and mean PCK(%) are given.
Methods Bk Cr Na Le Re Be Br Ba Ta Lw Rw MeanTesting on the source domain
S(CUB) 87.7 88.1 92.0 92.2 92.1 85.8 88.0 83.7 77.8 78.7 76.2 85.6S(CUB)+Adap 88.5 88.3 92.5 92.2 92.4 86.6 89.0 84.2 79.7 79.2 77.0 86.3
Testing on the target domainS(CUB) 78.8 81.2 83.5 85.1 84.9 53.0 76.3 69.8 45.1 60.2 61.7 71.0
S(CUB)+Adap 78.0 83.0 84.9 85.5 86.1 56.8 77.2 73.4 47.4 62.0 62.3 72.4
images randomly selected from source domains and 190,180,150,100 species of
images used for target domains. The testing set is split in the same way for
performance evaluation. Then we evaluate part transferability across datasets.
Here, the CUB200-2011 dataset with 200 species is used as source domain, and
the Birdsnap with 500 species is used for the target domain.
3.4.2 Results and Analysis
We use the detector trained on source training set only as the baseline. Then we
compare the performance of our methods in different dataset setting with baseline
detector. We can find several facts from Table 3.1. First, performing domain
adaption with target data do not gain a substantial performance improvement on
the source testing set. Also, we observe that there is a significant performance
gain when perform domain adaptation in the setting with small number of species
in the source domain. For example, our method achieves 61.4%, which is an
18.3% improvements over the baseline detector on the setting with ten species
of images used for source domain. It is also worth nothing that, training on
10 species with part labels and 190 species without part annotations obtain a
modest accuracy for part localization on 190 unseen species. This demonstrate
that semantic parts can be learn from a sufficiently-diverse set of classes but with
insufficient part annotations. However, the improvement exhibits a diminishing
when some species used for source domain is sufficiently large, because the feature
distribution between the source and target domains is relatively close in this case.
45
3.5 Conclusions
In this chapter, we focus on the problem of transferring semantic parts across
fine-grained species. We have proposed a deep domain adaptation method for
part detection. Our method combines part detection and domain adaptation in
the same learning pipeline. Then, we have looked at the question: how many
species of images are sufficient to learn a part detector. To answer this question,
we perform a quantitative evaluation on CUB200-2011 and Birdsnap datasets.
Experimental results suggest that a small number of species can be used to learn
a modest detector when training with domain adaptation techniques.
46
47
Chapter 4
Fine-grained Categorization with
Part Localization
In this chapter, we start to explore how to incorporate the technique of key-
point localization into the fine-grained categorization system. A well-designed
system for fine-grained categorization usually has three contradictory require-
ments: accuracy (the ability to identify objects among subordinate categories);
interpretability (the ability to provide the human-understandable explanation of
recognition system behavior); and efficiency (the speed of the system). To handle
the trade-off between accuracy and interpretability, we propose a ”Part-Stacked
CNN” and a ”Deeper Part-Stacked CNN” architectures armed with interpretabil-
ity. To obtain the information from the part level, we need to know the location
for each part. Hence, we utilize the technique of keypoint localiztion to ob-
tain part locations. Next, we can crop the part features, and then fused them
with the object feature for fine-grained categorization. Therefore, our method
can simultaneously encodes object-level and part-level cues, thereby outperforms
state-of-the-art approaches on Caltech-UCSD Birds-200-2011.
4.1 Introduction
Fine-grained visual categorization (FGVC) refers to the task of identifying ob-
jects from subordinate categories and is now an important subfield in object
48
California Gull Has its Beak mostly different from Ring Billed Gull
Figure 4.1: Overview of the proposed approach. We propose to classify fine-grained categories by modeling the subtle difference from specific object parts.Beyond classification results, the proposed DPS-CNN architecture also offershuman-understandable instructions on how to classify highly similar object cat-egories explicitly.
recognition. FGVC applications include, for example, recognizing species of
birds [8, 105, 110], pets [44, 71], flowers [5, 68], and cars [61, 89]. Lay individu-
als tend to find it easy to quickly distinguish basic-level categories (e.g., cars or
dogs), but identifying subordinate classes like ”Ringed-billed gull” or ”California
gull” can be difficult, even for bird experts. Tools that aid in this regard would
be of high practical value.
This task is made challenging due to the small inter-class variance caused
by subtle differences between subordinaries and the large intra-class variance
caused by negative factors such as differing pose, multiple views, and occlusions.
However, impressive progress [8,46,103,104,112] has been made over the last few
years, and fine-grained recognition techniques are now close to practical use in
various applications such as for wildlife observation and in surveillance systems.
While numerous attempts have been made to boost the classification accuracy
49
of FGVC [11,16,21,52,107], an important aspect of the problem has yet to be ad-
dressed, namely the ability to generate a human-understandable ”manual” on how
to distinguish fine-grained categories in detail. For example, ecological protection
volunteers would benefit from an algorithm that could not only accurately classify
bird species but also provide brief instructions on how to distinguish very similar
subspecies (a ”Ringed-billed” and ”California gull”, for instance, differ only in
their beak pattern, see Figure 4.1), aided by some intuitive illustrative exam-
ples. Existing fine-grained recognition methods that aim to provide a visual field
guide mostly follow a ”part-based one-vs.-one features” (POOFs) [6–8] routine
or employ human-in-the-loop methods [12,48,102]. However, since the amount of
available data requiring interpretation is increasing drastically, a method that si-
multaneously implements and interprets FGVC using deep learning methods [47]
is now both possible and advocated.
It is widely acknowledged that the subtle differences between fine-grained cate-
gories mostly reside in the unique properties of object parts [6,16,62,78,120,126].
Therefore, a practical solution to interpreting classification results as human-
understandable manuals is to discover classification criteria from object parts.
Some existing fine-grained datasets provide detailed part annotations including
part landmarks and attributes [61, 105]. However, they are usually associated
with a large number of object parts, which incur a heavy computational bur-
den for both part detection and classification. From this perspective, a method
that follows an object part-aware strategy to provide interpretable prediction cri-
teria at minimal computational effort but deals with large numbers of parts is
desirable. In this scenario, independently training a large convolutional neural
network (CNN) for each part and then combining them in a unified framework is
impractical [120].
Here we address the fine-grained categorization problem not only regarding
accuracy and efficiency when performing subordinate-level object recognition but
also about the interpretable characteristics of the resulting model. We do this by
learning a new part-based CNN for FGVC that models multiple object parts in a
unified framework with high efficiency. Similar to previous fine-grained recogni-
tion approaches, the proposed method consists of a localization module to detect
object parts (where pathway) and a classification module to classify fine-grained
50
categories at the subordinate level (what pathway). In particular, our key point
localization network structure is composed of a sub-network used in contempo-
rary classification networks (AlexNet [47] and BN-GoogleNet [38]) and a 1x1
convolutional layer followed by a softmax layer to predict evidence of part loca-
tions. The inferred part locations are then fed into the classification network, in
which a two-stream architecture is proposed to analyze images at both the ob-
ject level (global information) and part level (local information). Multiple parts
are then computed via a shared feature extraction route, separated directly on
feature maps using a part cropping layer, concatenated, and then fed into a shal-
lower network for object classification. Except for categorical predictions, our
method also generates interpretable classification instructions based on object
parts. Since the proposed deeper network architecture-based framework employs
a sharing strategy that stacks the computation of multiple parts, we call the
proposed architecture based on Alexnet Part-Stacked CNN (PS-CNN), and the
other one used deeper structure Deeper Part-Stacked CNN (DPS-CNN).
This section makes the following contributions:
1. DPS-CNN is the first efficient framework that not only achieves state-of-
the-art performance on Caltech-UCSD Birds-200-2011 but also allows in-
terpretation;
2. We explore a new paradigm for keypoint localization, which has exceeded
state of the art performance on Birds-200-2011 dataset;
3. The classification network in DPS-CNN follows a two-stream structure that
captures both object level (global) and part level (local) information, in
which a new share-and-divide strategy is presented to compute multiple
object parts. As a result, the proposed architecture is very efficient with
a capacity of 32 frames/sec 1 without sacrificing the fine-grained catego-
rization accuracy. Also, we propose a new strategy called scale mean-max
(SMM) for feature fusion learning.
This work is not a direct extension of state-of-the-art fine-grained classification
models [52, 119, 124, 125] but a significant development regarding the following
1For reference; a single CaffeNet run at 82 frames/sec under the same experimental setting.
51
aspects: Different to [124] who adapts FCN for part localization, we propose a
new paradigm for key point localization that first samples a small number of
representable pixels and then determine their labels via a convolutional layer
followed by a softmax layer; We also propose a new network architecture and
enrich the methodology used in [37]; Further, we introduce a simple but effective
part feature encoding (named Scale Average Max) method in contrast to Bilinear
in [52], Spatially Weighted Fisher Vector in [125], and Part-based Fully Connected
in [125].
The remainder of this chapter is organized as follows. Related works are sum-
marized in Section 4.2, and the proposed architectures including Part-Stacked
CNN (PS-CNN) and Deeper Part-Stacked CNN (DPS-CNN) are described in Sec-
tion 4.3 and Section 4.4. Detailed performance studies and analysis are presented
in Section 4.5, and in Section 4.6 we conclude and propose various applications
of the proposed DPS-CNN architecture.
4.2 Related Work
4.2.1 Keypoint Localization
. Subordinate categories share a fixed number of semantic components defined as
’parts’ or ’key points’ but with subtle differences in these components. Intuitively,
when distinguishing between two subordinate categories, the widely accepted
approach is to align components containing these fine differences. Therefore,
localizing parts or key points plays a crucial role in fine-grained recognition, as
demonstrated in recent works [6, 32, 62,120,123,129].
Seminal works in this area have relied on prior knowledge about the global
shape [18,64,65,81]. For example, the active shape model (ASM) uses a mixture
of Gaussian distributions to model the shape. Although these techniques provide
an effective way to locate facial landmarks, they cannot usually handle a wide
range of differences such as those seen in bird species recognition. The other group
of methods [11,50,54,56,84,118–120] trains a set of keypoint detectors to model
local appearance and then uses a spatial model to capture their dependencies and
has become more popular in recent years. Among them, the part localization
52
method proposed in [50, 84, 119] is most similar to ours. In [84], a convolutional
sub-network is used to predict the bounding box coordinates without using a
region candidate. Although its performance is acceptable because the network is
learned by jointly optimizing the part regression, classification, and alignment,
all parts of the model need to be trained separately. To tackle this problem, [50]
and [119] adopt the similar pipeline of Fast R-CNN [31], in which part region
candidates are generated to learn the part detector. In this work, we discard the
common proposal-generating process and regard all receptive field centers 1 of a
certain intermediate layer as potential candidate key points. This strategy results
in a highly efficient localization network since we take advantage of the natural
properties of CNNs to avoid the process of proposal generation.
Our work is also inspired by and inherited from fully convolutional networks
(FCNs) [57], which produces dense predictions with convolutional networks. How-
ever, our network structure is best regarded as a fast and effective approach to
predict sparse pixels since we only need to determine the class labels of the cen-
ters of the receptive fields of interest. Thus, FCN is more suited to segmentation,
while our framework is designed for sparse keypoint detection. As FCN aims to
predict intermediate feature maps, then upsample them to match the input image
size for pixel-wise prediction. Recent works [109, 124] borrow this idea directly
for keypoint localization. During training, both of these works resize the ground
truths to the size of the output feature maps and then use them to supervise the
network learning, while, during testing, the predicted feature maps are resized to
match the input size to generate the final key point prediction. However, these
methods cannot guarantee accurate position prediction due to the upsampling
process.
4.2.2 Fine-Grained Visual Categorization
. many methods have been developed to classify object categories at the subordi-
nate level. The best performing methods have gained performance improvements
by exploiting the following three aspects: more discriminative features (including
1Here the receptive field means the area of the input image, to which a location in a higherlayer feature map correspond.
53
(b)
(a)
Figure 4.2: Illustration of the localization network. (a). Suppose a certain layeroutputs feature maps with size 3x3, and the corresponding receptive fields areshown by dashed box. In this paper, we represent the center of each receptivefiled with a feature vector at the corresponding position. (b). The first columnis the input image. In the second image, each black dot is a candidate pointwhich indicates the center of a receptive field. The final stage is to determine ifa candidate point is a particular part or not.
54
deep CNNs) for better visual representation [9, 47, 80, 87, 92]; explicit alignment
approaches to eliminate pose displacements [11, 29]; and part-based methods to
examine the impact of object parts [6, 32, 62, 120, 123, 129]. Another approach
has been used to explore human-in-the-loop methods [13, 21, 106] to identify the
most discriminative regions for classifying fine-grained categories. Although such
methods provide direct and important information about how humans perform
fine-grained recognition, they are not scalable due to the need for human inter-
actions during testing. Of these, part-based methods are thought to be most
relevant to fine-grained recognition, in which the subtle differences between fine-
grained categories mostly relate to the unique object part properties.
Some part-based methods [6,120] employ strong annotations including bound-
ing boxes, part landmarks, or attributes from existing fine-grained recognition
datasets [61, 71, 103, 105]. While strong supervision significantly boosts per-
formance, the expensive human labeling process motivates the use of weakly-
supervised fine-grained recognition without manually labeled part annotations,
i.e., discovering object parts in an unsupervised fashion [46,52,86]. Current state-
of-the-art methods for fine-grained recognition include [124] and [52], which both
employ deep feature encoding method, while our methods are largely inherited
from [120], who first detected the location of two object parts and then trained
an individual CNN based on the unique properties of each part. Compared to
part-based R-CNN, the proposed methods are far more efficient for both detec-
tion and classification. As a result, we can use many more object parts than [120],
while still maintaining speed during testing.
Lin et al . [52], argued that manually defined parts were sub-optimal for object
recognition and thus proposed a bilinear model consisting of two streams whose
roles were interchangeable as detectors or features. Although this design exploited
the data-driven approach to improve classification performance possibly, it also
made the resulting model difficult to interpret. In contrast, our methods attempt
to balance the need for classification accuracy and model interpretability in fine-
grained recognition systems.
55
2x resolution Input Image
454x454
Input Image 227x227
ALEXNET Conv+ReLU
+Pool (5 stages)
ALEXNET Conv+ReLU
+Pool (5 stages)
27x27x256
13x13x256 6x6x256
FCN Conv+ReLU+Pool (7 stages)
6x6x32
4096 4096 K
6x6x 32x
(M+8)
F
Pool5
conv5 fmap
M part locations
conv5_1 1x1 conv
reduce dim.
27x27x32
PART CROP
crown
belly
tail
fc6
fc7 fc8
Figure 4.3: The network architecture of the proposed Part-Stacked CNN model.The model consists of 1) a fully convolutional network for part landmark local-ization; 2) a part stream where multiple parts share the same feature extractionprocedure, while being separated by a novel part crop layer given detected part lo-cations; 3) an object stream with lower spatial resolution input images to capturebounding-box level supervision; and 4) three fully connected layers to achieve thefinal classification results based on a concatenated feature map containing infor-mation from all parts and the bounding box.
4.3 Part-Stacked CNN
We present the model architecture of the proposed Part-Stacked CNN in this
section. In accordance with the common framework for fine-grained recognition,
the proposed architecture is decomposed into a Localization Network (Section
4.4.1) and a Classification Network (Section 4.4.2). We adopt CaffeNet [40], a
slightly modified version of the standard seven-layer AlexNet [47] architecture,
as the basic structure of the network; deeper networks could potentially lead to
better recognition accuracy, but may also result in lower efficiency.
A unique design in our architecture is that the message is transferring op-
eration from the localization network to the classification network, i.e. using
detected part locations to perform part-based classification, is conducted directly
on the conv5 output feature maps within the process of data forwarding. It is a
significant difference compared to the standard two-stage pipeline of part-based
56
R-CNN [120] that consecutively localizes object parts and then trains part-specific
CNNs on the detected regions. Based on this design, a set of sharing schemes are
performed to make the proposed PS-CNN fairly efficient for both learning and
inference. Figure 4.3 illustrates the overall network architecture.
4.3.1 Localization Network
The first stage of the proposed architecture is a localization network that aims to
detect the location of object parts. We employ the simplest form of part landmark
annotations, i.e. a 2D key point is annotated at the center of each object part.
Assume that M - the number of object parts labeled in the dataset, is sufficiently
large to offer a complete set of object parts on which fine-grained categories are
usually different from each other. Motivated by recent progress of human pose
estimation [57] and semantic segmentation [95], we adopt a fully convolutional
network (FCN) [63] to generate dense output feature maps for locating object
parts.
We model the part localization process as a multi-class classification problem
on dense output spatial positions. In particular, suppose the output of the last
convolutional layer in the FCN is in the size of h × w × d, where h and w are
spatial dimensions and d is the number of channels. We set d = M + 1. Here M
is the number of object parts, and 1 denotes for an additional channel to model
the background. To generate corresponding ground-truth labels in the form of
feature maps, units indexed by h×w spatial positions are labeled by their nearest
object part; units that are not close to any of the labeled parts (with an overlap
< 0.5 on receptive field) are labeled as background.
A practical problem here is to determine the model depth and the size of
input images for training the FCN. Generally speaking, layers at later stages
carry more discriminative power and thus are more likely to generate promising
localization results; however, their receptive fields are also much larger than those
of previous layers. For example, the receptive field of conv5 layer in CaffeNet has
a size of 163 × 163 compared to the 227 × 227 input image, which is too large
to model an object part. We propose a simple trick to deal with this problem,
57
2x resolution Input Image
454x454
ALEXNET Conv+ReLU
+Pool (5 stages)
27x27x256 27x27x512 27x27x(M+1)
5x5 Gaussian
Kernel
27x27 Max-pooling
M locations TRAINING
conv5 conv6 1x1
conv+ ReLU
1x1 conv
27x27x(M+1) 27x27x(M+1)
conv7 softmax
Figure 4.4: Demonstration of the localization network. The training process isdenoted inside the dashed box. For inference, a Gaussian kernel is then introducedto remove noise. The results are M 2D part locations in the 27×27 conv5 featuremap.
i.e., upsampling the input images so that the fixed-size receptive fields denoting
object parts become relatively smaller compared to the whole object, while still
being able to use layers at later stages to guarantee enough discriminative power.
The localization network in the proposed PS-CNN is illustrated in Figure 4.5.
The input of the FCN is a bounding-box-cropped RGB image, warped and resized
into a fixed size of 454 × 454. The structure of the first five layers is identical
to those in CaffeNet, which leads to a 27 × 27 × 256 output after conv5 layer.
Afterwards, we further introduce a 1 × 1 convolutional layer with 512 output
channels as conv6, and another 1 × 1 convolutional layer with M + 1 outputs
termed conv7 to perform classification. By adopting a spatial preserving softmax
that normalizes predictions at each spatial location of the feature map, the final
loss function is a sum of softmax loss at all 27× 27 positions.
4.3.2 Classification network
The second stage of the proposed PS-CNN is a classification network with the in-
ferred part locations given as an input. It follows a two-stream architecture with
a Part Stream and a Object Stream to capture semantics from multiple levels.
A sub-network consisting of three fully connected layers is then performed as an
object classifier, as shown in Figure 4.3.
58
Part stream. The part stream acts as the core of the proposed PS-CNN ar-
chitecture. To capture object-part-dependent differences between fine-grained
categories, one can train a set of part CNNs, each one of which conducts classi-
fication on a part separately, as proposed by Zhang et al . [120]. Although such
method worked well for [120] who only employed two object parts, we argue that
it is not applicable when the number of object parts is much larger in our case,
because of the high time and space complexity.
In PS-CNN, we introduce two strategies to improve the efficiency of the part
stream. The first one is model parameter sharing. Specifically, model parameters
of the first five convolutional layers are shared among all object parts, which can
be regarded as a generic part-level feature extractor. This strategy leads to less
parameters in the proposed architecture and thus reduces the risk of overfitting.
Other than model parameter sharing, we also conduct a computational sharing
strategy. The goal is to make sure that the feature extraction procedure of all
parts only requires one pass through the convolutional layers. Analogous to the
localization network, the input images of the part stream are in doubled resolution
454× 454 so that the respective receptive fields are not too large to model object
parts; forwarding the network to conv5 layer generates output feature maps of
size 27× 27. By far, the computation of all object parts is completely shared.
After performing the shared feature extraction procedure, the computation
of each object part is then partitioned through a part crop layer to model part-
specific classification cues. For each part, the part crop layer extracts a local
neighborhood region centered at the detected part location. Features outside the
cropped region are simply dropped. In practice, we crop 6 × 6 neighborhood
regions out of the 27 × 27 conv5 feature maps to match the output size of the
object stream. The resultant receptive fields for the cropped feature maps has
a width of 243, given the receptive field size of conv5 layers and the respective
stride.
Object stream. The object stream utilizes bounding-box-level supervision to
capture object-level semantics for fine-grained recognition. It follows the general
architecture of CaffeNet, in which the input of the network is a 227 × 227 RGB
image and the output of pool5 layer are 6× 6 feature maps.
59
We find the design of the two-stream architecture in PS-CNN analogous to
the famous Deformable Part-based Models [24], in which object-level features are
captured through a root filter in a coarser scale, while detailed part-level infor-
mation is modeled by several part filters at a finer scale. We find it critical to
measure visual cues from multiple semantic levels in an object recognition algo-
rithm.
Dimensionality reduction and fully connected layers. The aforemen-
tioned two-stream architecture generates an individual feature map for each ob-
ject part and bounding box. When conducting classification, they serve as an
over-complete set of CNN features from multiple scales. Following the standard
CaffeNet architecture, we employ a DNN including three fully connected layers
as object classifiers. The first fully connected layer fc6 now becomes a part con-
catenation layer whose input is generated by stacking the output feature maps of
the part stream and the object stream together. However, such a concatenating
process requires M + 1 times more model parameters than the original fc6 layer
in CaffeNet, which leads to a huge memory cost.
To reduce model parameters, we introduce a 1× 1 convolutional layer termed
conv5 1 in the part stream that projects the 256 dimensional conv5 output to
32-d. It is identical to a low-rank projection of the model output and thus can be
initialized through standard PCA. Nevertheless, in our experiments, we find that
directly initializing the weights of the additional convolution by PCA in practice
worsens the performance. To enable domain-specific fine-tuning from pre-trained
CNN model weights, we train an auxiliary CNN to initialize the weights for the
additional convolutional layer.
Let Xc ∈ RN×M×6×6 be the cth 6× 6 cropped region around the center point
(h∗c , w
∗c ) from conv5 1 feature maps X ∈ R
N×M×27×27, where (h∗c , w
∗c ) is the pre-
dicted location for part c and N is the number of output feature maps. The
output of part concatenation layer fc6 can be formulated as:
fout(X) = σ(M∑c=1
(W c)TXc), (4.1)
60
where W c is the model parameters for part c in fc6 layer, and σ is an activation
function.
We conduct the standard gradient descent method to train the classification
network. The most complicated part for computing gradients lies in the dimension
reduction layer due to the impact of part cropping. Specifically, the gradient of
each cropped part feature map (in 6 × 6 spatial resolution) is projected back to
the original size of conv5 (27× 27 feature maps) according to the respective part
location and then summed up. Note that the proposed PS-CNN is implemented
as a two stage framework, i.e. after training the FCN, weights of the localization
network are fixed when training the classification network.
4.4 Deeper Part-Stacked CNN
A key motivation of our proposed method is to produce a fine-grained recogni-
tion system that not only considers recognition accuracy but also addresses effi-
ciency and interpretability. To ensure that the resulting model is interpretable,
we employ strong part-level annotations with the potential to provide human-
understandable classification criteria. We also adopt several strategies such as
sparse prediction instead of dense prediction to eliminate part proposal genera-
tion and to share computation for all part features. For the sake of classification
accuracy, we learn a comprehensive representation by incorporating both global
(object-level) and local (part-level) features. Based on these, in this section, we
present the model architecture of the proposed Deeper Part-Stacked CNN (DPS-
CNN).
According to the common framework for fine-grained recognition, the pro-
posed architecture is decomposed into a localization network (Section 4.4.1) and
a classification network (Section 4.4.2). In our previous work [37], we adopted
CaffeNet [40], a slightly modified version of the standard seven-layer AlexNet ar-
chitecture [47], as the basic network structure. In this paper, we use a deeper but
more powerful network (BN-GoogleNet) [38] as a substitute. A unique feature of
our architecture is that the message is transferring operation from the localiza-
tion network to the classification network, which uses the detected part locations
to perform part-based classification, is conducted directly on the Inception-4a
61
convm
ax pool
…inception(4a)
conv
1x1 conv
Input image448x448
….
….
28x28x16
softmax
part label28x28
gaussian kernel28x28
TRAINING
Figure 4.5: Demonstration of the localization network. Training process is de-noted inside the dashed box. For inference, a Gaussian kernel is then introducedto remove noise. The results are M 2D part locations in the 27×27 conv5 featuremap.
output feature maps within the data forwarding process. This is a significant
departure from the standard two-stage pipeline of part-based R-CNN, which
consecutively localizes object parts and then trains part-specific CNNs on the
detected regions. Based on this design, sharing schemes are performed to make
the proposed DPS-CNN fairly efficient for both learning and inference. Figure
4.6 illustrates the overall network architecture.architecture.
4.4.1 Localization Network
The first stage in our proposed architecture is a localization network that aims
to detect the location of object parts. We employ the simplest form of part
landmark annotation, where a 2D key point is annotated at the center of each
object part. Assume that M - the number of object parts labeled in the dataset
is sufficiently large to offer a complete set of object parts in which fine-grained
categories are usually different. A naive approach to predicting these key points
is to apply FCN architecture [?] for dense pixel-wise prediction. However, this
method biases the learned predictor because, in this task and unlike semantic
segmentation, the number of keypoint annotations is extremely small compared
to the number of irrelevant pixels.
Motivated by the recent progress in object detection [77] and semantic seg-
mentation [57], we propose to use the centers of receptive fields as key point can-
didates and use a fully convolutional network to perform sparse pixel prediction
62
to locate the key points of object parts (see Figure 4.2(b)). In the field of object
detection, box candidates expected to be likely objects are first extracted using
proposal-generating methods such as selective search [100] and region proposal
networks [77]. Then, CNN features are learned to represent these box candidates
and finally used to determine their class label. We adapt this pipeline to key
point localization but omit the candidate generation process and simply treat the
centers of receptive fields corresponding to a certain layer as candidate points.
As shown in Figure 4.2(a), the advantage of using this method is that each candi-
date point can be represented by a 1D cross-channel feature vector in the output
feature maps. Also, in our candidate point evaluation experiments in Table 4.9,
we find that given an input image of size 448x448 and using the receptive fields
of the inception-4a layer in BN-GoogleNet generates 28x28 candidate points and
100% recall at [email protected].
Fully convolutional network. An FCN is achieved by replacing the parameter-
rich fully connected layers in standard CNN architectures constructed by convo-
lutional layers with kernels of spatial size. Given an input RGB image, the output
of an FCN is a feature map of reduced dimension compared to the input. The
computation of each unit in the feature map only corresponds to pixels inside a
region of fixed size in the input image, which is called its feature map. We prefer
FCNs because of the following reasons: (1) feature maps generated by FCNs can
be directly utilized as the part locating results in the classification network, as
detailed in Section 4.4.2; (2) the results of multiple object parts can be obtained
simultaneously; (3) FCNs are very efficient for both learning and inference.
Learning. We model the part localization process as a multi-class classification
problem on sparse output spatial positions. Specifically, suppose the output of
the last FCN convolutional layer is of size h× w × d, where h and w are spatial
dimensions and d is the number of channels. We set d = M + 1. Here, M
is the number of object parts and 1 denotes an additional channel to model the
background. To generate corresponding ground-truth labels in the form of feature
maps, units indexed by h×w spatial positions are labeled with their nearest object
part; units that are not close to any of the labeled parts (with an overlap with
respect to a receptive field) are labeled as background. In this way, ground-
63
truth part annotations are transformed into the form of corresponding feature
maps, while in recent works that directly apply FCNs [109, 124], the supervision
information is generated by directly resizing the part ground-truth image.
Another practical problem here is determining the model depth and the input
image size for training the FCN. Generally, layers at later stages carry more
discriminative power and, therefore, are more likely to generate good localization
results; however, their receptive fields are also much larger than those of previous
layers. For example, the receptive field of the inception-4a layer in BN-GoogleNet
has a size of 107×107 compared to the 224×224 input image, which is too large to
model an object part. We propose a simple trick to deal with this problem, namely
upsampling the input images so that the fixed size receptive fields denoting object
parts become relatively smaller compared to the whole object, while still using
later stage layers to guarantee discriminative power. In the proposed architecture,
the input image is upsampled to double the resolution and the inception-4a layer
is adopted to guarantee discrimination.
The localization network is illustrated in Figure 4.5. The input images are
warped and resized into a fixed size of 448 × 448. All layers from the beginning
to the inception-4a layer are cut from the BN-GoogleNet architecture, so the
output size of the inception-4a layer is 28× 28× 576. Then, we further introduce
an 1 × 1 convolutional layer with M + 1 outputs termed conv for classification.
By adopting a location-preserving softmax that normalizes predictions at each
spatial location of the feature map, the final loss function is a sum of softmax
loss at all 28× 28 positions:
L = −28∑h=1
28∑w=1
log σ(h, w, c), (4.2)
where
σ(h, w, c) =exp(fconv(h, w, c))∑Mc=0 exp(fconv(h, w, c))
.
Here, c ∈ [0, 1, ...,M ] is the part label of the patch at location (h, w), where the
label 0 denotes background. fconv(h, w, c) stands for the output of conv layer at
spatial position (h, w) and channel c.
64
convm
ax pool…
inception(4a)
convm
ax pool…
inception(4a)
inception(5b)
avg pool (14x14)
convsoftm
axpartPredict ….
….
28x28xN
28x28x576
…
…
P1 7x7x576
…
…
….
P2 7x7x576
PN 7x7x576
avg poolinception(4e)
……
P1 7x7x576
…
……
….
P2 7x7x576
PN 7x7x576
Part Crop
P1
1024-d
P2
1024-d
PN
1024-d
….
1024-d
Fusion
Fusion feature
Kfc + softmax
inception(5a)
inception(4e)
inception(4d)
inception(4c)
inception(4b)
inception(4a)
inception(3c)
inception(3b)
inception(3a)
max poolconv
max poolconv
convm
ax pool…
inception(4a)conv
softmax
partPredict …………….……
…………….….
28x28xN
PP
convm
ax pool…
inception(4a)
28x28x576
…
inception(5b)
avg pool(14x14)
inception(5a)
inception(4e)
inception(4d)
inception(4c)
inception(4b)
inception(4a)
inception(3c)
inception(3b)
inception(3a)
max poolconv
max poolconv
P1
1024-d
P2
1024-d
PN
1024-d
….
1024-d
FusF
(1)
(2)
(3)(4)
Figure 4.6: Network architecture of the proposed Deeper Part-Stacked CNN. Themodel consists of: (1) a fully convolutional network for part landmark localiza-tion; (2) a part stream where multiple parts share the same feature extractionprocedure, while being separated by a novel part crop layer given detected partlocations; (3) an object stream to capture global information; and (4) Featurefusion layer with input feature vectors from part stream and object stream toachieve the final feature representation.
Inference. Inference starts from the output of the learned FCN, i.e., (M + 1)
part-specific heat maps of size 28×28, in which we introduce a Gaussian kernel G
to remove isolated noise in the feature maps. The final output of the localization
network are M locations in the 28 × 28 conv feature map, each of which is
computed as the location with the maximum response for one object part.
Meanwhile, considering that object parts may be missing in some images due
to varied poses and occlusion, we set a threshold μ that if the maximum response
of a part is below μ, we simply discard this part’s channel in the classification
network for this image. Let g(h, w, c) = σ(h, w, c) ∗ G, the inferred part locations
are given as:
(h∗c , w
∗c ) =
{argmax h,w g(h, w, c) if g(h∗
c , w∗c , c) > μ,
(−1,−1) otherwise.(4.3)
65
P1
P2
PN
….
(N+1) x 1024-d
P1
1024-d
P2
1024-d
PN
1024-d
….
1024-d
P1
1024-d
P2
1024-d
PN
1024-d
….
1024-d
scale layerscale layer
scale layer…
.1024-d
SUM
scale layer
P1
1024-d
P2
1024-d
PN
1024-d
….
1024-d
scale layerscale layer
scale layer…
.
2048-dAVG
scale layer
MAX
P1
1024-d
P2
1024-d
PN
1024-d
….
1024-d
scale layerscale layer
scale layer…
.
1024-d
MAX
scale layer
(a) FC (b) SS
(c) SM (d) SAM
Figure 4.7: Different strategies for feature fusion which are illustrated in (a) Fullyconnected,(b) Scale Sum, (c) Scale Max and (d) Scale Average Max respectively.
66
4.4.2 Classification network
The second stage of the proposed DPS-CNN is a classification network with the
inferred part locations given as an input. As shown in Figure 4.6, it follows a
two-stream architecture with a Part Stream and a Object Stream to capture se-
mantics from different angles. The outputs of both two streams are fed into a
feature fusion layer followed by a fully connected layer and a softmax layer.
Part stream. The part stream is the core of the proposed DPS-CNN archi-
tecture. To capture object part-dependent differences between fine-grained cate-
gories, one can train a set of part CNNs, each one of which conducts classification
on a part separately, as proposed by Zhang et al . [120]. Although such method
works well for situations employing two object parts [120], we argue that this
approach is not applicable when the number of object parts is much larger, as in
our case, because of the high time and space complexities.
We introduce two strategies to improve part stream efficiency, the first being
model parameter sharing. Specifically, model parameters of layers before the part
crop layer and inception-4e are shared among all object parts and can be regarded
as a generic part-level feature extractor. This strategy reduces the number of
parameters in the proposed architecture and thus reduces the risk of overfitting.
We also introduce a part crop layer as a computational sharing strategy. The
layer ensures that the feature extraction procedure of all parts only requires one
pass through the convolutional layers.
After performing the shared feature extraction procedure, the computation
of each object part is then partitioned through a part crop layer to model part-
specific classification cues. As shown in Figure 4.6, the input for the part crop
layer is a set of feature maps (the output of inception-4a layer in our architec-
ture) and the predicted part locations from the previous localization network,
which also reside in inception-4a feature maps. For each part, the part crop layer
extracts a local neighborhood centered on the detected part location. Features
outside the cropped region are simply discarded. In practice, we crop l×h neigh-
borhood regions from the 28×28 inception-4a feature maps. The cropped size of
feature regions may have an impact on recognition performance, because larger
67
crops will result in redundancy when extracting multiple part features, while
smaller crops cannot guarantee rich information. For simplicity, we use l = h = 7
in this paper to ensure that the resulting receptive field is large enough to cover
the entire part.
Object stream. The object stream captures object-level semantics for fine-
grained recognition. It follows the general architecture of BN-GoogleNet, in which
the input of the network is a 448×448 RGB image and the output of incenption-
5b layer are 14 × 14 feature maps. Therefore, we use 14 × 14 average pooling
instead of 7× 7 in original setting.
The design of the two-stream architecture in DPS-CNN is analogous to the
famous Deformable Part-based Models [24], in which object-level features are cap-
tured through a root filter in a coarser scale, while detailed part-level information
is modeled by several part filters at a finer scale. We find it critical to measure
visual cues from multiple semantic levels in an object recognition algorithm.
We conduct the standard gradient descent to train the classification network.
It should be noted, however, that the gradient of each element ∂E∂Xi,j
in inception-
4a feature maps is calculated by the following equation:
∂E
∂Xi,j
=M∑c=1
φ(∂E
∂Xci,j
), (4.4)
where E is the loss function, Xci,j is the feature maps cropped by part c and
φ(∂E
∂Xci,j
) =
⎧⎨⎩
∂E∂Xc
i,jXi,j corresponding to Xc
i,j,
0 otherwise.(4.5)
Specifically, the gradient of each cropped part feature map (in 7 × 7 spatial
resolution) is projected back to the original size of inception-4a (28× 28 feature
maps) according to the respective part location and then summed. The computa-
tion of all other layers simply follows the standard gradient rules. Note that the
proposed DPS-CNN is implemented as a two stage framework, i.e. after training
the FCN, weights of the localization network are fixed when training the classifi-
68
cation network.
Feature Fusion
The commonest method [50,120] for combining all part-level and object-level
features is to simply concatenate all these feature vectors as illustrated in Figure
4.7(a). However, this approach may cause feature redundancy and also suffer
from high-dimensionality when part numbers become large. To effectively utilize
all part- and object-level features, we present three options for learning fusion
features: scale sum (SS), scale max (SM), and scale mean-max (SMM), as il-
lustrated in Figure 4.7(a), Figure 4.7(b), and Figure 4.7(d), respectively. All
three methods include the shared process of placing a scale layer on top of each
branch. Nevertheless, as indicated by their names, the scale sum feature is the
element-wise sum of all output branches, the scale max feature is generated by
an element-wise maximum operation, while the scale average-max feature is the
concatenation of element-wise mean and max features. In our previous work [37]
based on the standard CaffeNet architecture, each branch from the part stream
and the object stream was connected with an independent fc6 layer to encourage
diversity features, and the final fusion feature was the sum of all the outputs of
these fc6 layers. As this fusion process requires M + 1 times model parameters
more than the original fc6 layer in CaffeNet and consequently incurs a huge mem-
ory cost, a 1 × 1 convolutional layer is used for dimensionality reduction. Here
we redesign this component for simplicity and to improve performance. First, a
shared inception module is placed on top of the cropped part region to generate
higher level features. Also, a scale layer follows each branch feature to encour-
age diversity between parts. Furthermore, the scale layer has fewer parameters
than the fully connected layer and, therefore, reduces the risk of overfitting and
decreases the model storage requirements.
4.5 Experiments
In this section we present experimental results and a thorough analysis of the
proposed methods. Specifically, we evaluate the performance from four different
aspects: localization accuracy, classification accuracy, inference efficiency, and
69
model interpretation.
4.5.1 Dataset and implementation details
Experiments are conducted on the widely used fine-grained classification bench-
mark the Caltech-UCSD Birds dataset (CUB-200-2011) [105]. The dataset con-
tains 200 bird categories with roughly 30 training images per category. In the
training phase we adopt strong supervision available in the dataset, i.e. we em-
ploy 2D key point part annotations of altogether M = 15 object parts together
with image-level labels and object bounding boxes.
The labeled parts1 imply places where people usually focus on when being
asked to classify fine-grained categories; thus they provide valuable information
for generating human-understandable systems.
Both Part-Stacked CNN and Deeper Part-Stacked CNN architecture are im-
plemented using the open-source package Caffe [40]. Specifically, input images
are warped to a fixed size of 512 × 512, randomly cropped into 448 × 448, and
then fed into the localization network and the part stream in the classification
network as input.
4.5.2 Localization results for PSCNN
As the localization results in our method are directly delivered to the classifica-
tion network at feature-map-level, we do not intend to achieve accurate keypoint
localization at pixel-level but instead, focus on a rougher correctness measure.
The localization accuracy is quantitatively assessed using APK (Average Preci-
sion of Key points) [117]. Following [58], we consider a key point to be correctly
predicted if the prediction lies within a Euclidean distance of α times the maxi-
mum of the bounding box width and height compared to the ground truth. We
set α = 0.1 in all the analysis below.
The adopted FCN architecture in PS-CNN achieves a reasonably inspiring
86.6% APK on the test set of CUB-200-2011 for 15 object parts. Specifically,
the additional 1 × 1 convolutional layer and the employed Gaussian smoothing
1The 15 object parts are back, beak, belly, breast, crown, forehead, left eye, left leg, leftwing, nape, right eye, right leg, right wing, tail, and throat.
70
Figure 4.8: Typical localization results on CUB-200-2011 test set. We show 6 ofthe 15 detected parts here. They are: beak (red), belly (green), crown (blue),right eye (yellow), right leg (magenta), tail (cyan). Better viewed in color.
part throat beak crown forehead right eye nape left eye back
APK 0.908 0.894 0.894 0.885 0.861 0.857 0.850 0.807
part breast belly right leg tail left leg right wing left wing overall
APK 0.799 0.794 0.775 0.760 0.750 0.678 0.670 0.866
Table 4.1: APK for each object part in the CUB-200-2011 test set in descendingorder.
kernel delivers 1.5% and 2% improvements over the results using standard five
convolutional layers in AlexNet, respectively. To further understand the perfor-
mance gains from our network designs, we also show experimental comparisons on
different model architectures in Table 4.2 using the following evaluation metrics.
a) Mean Precision of Key points over images (MPK).
b) Mean Recall of Key points over images (MRK).
c) Average Precision of Key points (APK).
71
Model architecture MPK MRK APK
conv5+cls 70.0 80.6 83.5conv5+conv6(256)+cls 71.3 81.8 84.7conv5+conv6(512)+cls 71.5 81.9 84.8conv5+conv6(512)+cls+gaussian 80.0 83.8 86.6
Table 4.2: Comparison of different model architectures on localization results.“conv5” stands for the first 5 convolutional layers in CaffeNet; “conv6(256)”stands for the additional 1 × 1 convolutional layer with 256 output channels;“cls” denotes the classification layer with M + 1 output channels; “gaussian”represents a Gaussian kernel for smoothing.
BBox only +2 part +4 part +8 part +15 part
69.08 73.72 74.84 76.63 76.41
Table 4.3: The effect of increasing the number of object parts on the classificationaccuracy.
Furthermore, we present per part APK s in Table 4.1. An interesting phe-
nomenon here is that parts residing near the head of the birds tend to be located
more accurately. It turns out that the birds’ head has a relatively more stable
structure with fewer deformations and lower probability to be occluded. On the
contrary, parts that are highly deformable such as wings and legs get lower APK
values. Figure 4.8 shows typical localization results of the proposed method.
4.5.3 Classification results for PSCNN
We begin the analysis of classification results by a study on the discriminative
power of each object part. Each time we select one object part as the input and
discard the computation of all other parts. Different parts reveal significantly
different classification results. The most discriminative part crown itself achieves
a quite impressive accuracy of 57%, while the lowest accuracy is only 10% for
part beak. Therefore, to obtain better classification results, it may be beneficial
to find a rational combination or order of object parts instead of directly ran the
experiments on all parts altogether.
We, therefore, introduce a strategy that incrementally adds object parts to
72
BBox only +2 part +4 part +8 part +15 part
69.08 73.72 74.84 76.63 76.41
Table 4.4: The effect of increasing the number of object parts on the classificationaccuracy.
the whole framework and iteratively trains the model. Specifically, starting from
a model trained on bounding-box supervision only, which is also the baseline of
the proposed method, we iteratively insert object parts into the framework and
re-finetune the PS-CNN model. The number of parts added in each iteration in-
creases exponentially, i.e., in the ith iteration, 2i parts are selected and inserted.
When starting from an initialized model with relatively high performance, intro-
ducing a new object part into the framework does not require to run a brand
new classification procedure based on this particular part alone; ideally, only the
classification of highly confusing categories that may be distinguished from the
new part will be impacted and amended. As a result, this procedure overcomes
the drawback raised by the existence of object parts with lower discriminative
power. In our implementation, the ordering of part inclusion is determined by its
discriminative power measured by the classification accuracy using each part only
(see Supplementary for details). Table 4.4 reveals that as the number of object
parts increases from 0 to 8, the classification accuracy improves gradually and
then becomes saturated. Further increasing the part number does not lead to a
better accuracy; however, it does provide more resources for performing explicit
model interpretation.
Table 4.11 shows the performance comparison between PS-CNN and exist-
ing fine-grained recognition methods. Since the CNN architecture has a large
impact on the recognition performance, for a fair comparison, we only compare
results reported on the standard seven-layer architecture. Deeper models could
undoubtedly lead to better accuracy but also result in less efficiency. The com-
plete PS-CNN model with a bounding-box and 15 object parts achieves 76%
accuracy, which is comparable with part-based R-CNN [120], while being slightly
lower than several most recent state-of-the-art methods [50, 52, 84] due to the
effectiveness-efficiency tradeoff. In particular, our model is over two orders of
73
magnitude faster than [120], requiring only 0.05 seconds to perform end-to-end
classification on a test image. This number is quite inspiring, especially consid-
ering the number of parts used in the proposed method. The efficiency makes
it possible for the proposed method to be conducted in real-time, leading to
potential applications in the video domain.
Method Train Anno. Test Anno. Acc.
Constellation [86] n/a n/a 68.5Attention [111] n/a n/a 69.7Bilinear-CNN [52] n/a n/a 74.2Weak FGVC [127] n/a n/a 75.0
CNNaug [76] BBox BBox 61.8Alignment [28] BBox BBox 67.0No parts [46] BBox BBox 74.9Bilinear-CNN [52] BBox BBox 80.4
Part R-CNN [120] BBox+Parts n/a 73.9PoseNorm CNN [11] BBox+Parts n/a 75.7
POOF [6] BBox+Parts BBox 56.8DPD+DeCAF [22] BBox+Parts BBox 65.0Deep LAC [50] BBox+Parts BBox 80.2Multi-proposal [84] BBox+Parts BBox 80.3Part R-CNN [120] BBox+Parts BBox 76.4PS-CNN BBox+Parts BBox 76.6
Table 4.5: Comparison with state-of-the-art methods on the CUB-200-2011dataset. To conduct fair comparisons, for all the methods using deep features,we report their results on the standard seven-layer architecture (mostly ALexNetexcept VGG-m for [52]) if possible. Note that our method achieves comparableresults with state-of-the-art while running in real-time.
4.5.4 Localization Results for DPSCNN
Following [58], we consider a key point to be correctly predicted if the predic-
tion lies within a Euclidean distance of α times the maximum of the input width
and height compared to the ground truth. Localization results are reported on
multiple values of α ∈ {0.1, 0.05, 0.02} in the analysis below. The value α in the
PCK metric is introduced to measure the error tolerance in keypoint localization.
74
To investigate the effect of the selected layer for keypoint localization, we per-
form experiments using the inception-4a,inception-4b,inception-4c and inception-
4d layers as part detector layers. As shown in Table 4.7, a higher layer with a
larger receptive field tends to achieve better localization performance than a lower
layer with α = 0.1. This is mainly because the larger receptive fields are crucial
for capturing spatial relationships between parts and improve performance (see
Table 4.6). However, in contrast, for α = 0.05 or 0.02, the performance decreases
at deeper layers. One possible explanation is that although higher layers obtain
better semantic information about the object, they lose more detailed spatial in-
formation. To evaluate the effectiveness of our key point localization approach,
we also compare it with recently published works [37, 118, 124] providing PCK
evaluation results on CUB-200-2011 along with experimental results using a more
consistent evaluation metric called average precision of key points (APK), which
correctly penalizes both missed and false-positive detections [117]. As can be
seen from the Table 4.7, our method outperforms existing techniques with vari-
ous α setting regarding PCK. Also, the most striking result is that our approach
outperforms the compared methods with large margins when using small α value.
For the key point localization task, we follow the proposal-based object detec-
tion method pipeline; centers of receptive fields corresponding to a certain layer
are first regarded as candidate points and then forwarded to a fully convolutional
network for further classification. Similar to object detection using proposals,
whether selected candidate points have a good coverage of pixels of interest in
the test image plays a crucial role in keypoint localization since missed key points
cannot be recovered in subsequent classification. Thus, we first evaluate the can-
didate point sampling method. The evaluation is based on the PCK metric [117],
in which the error tolerance is normalized by the input image size. For consis-
tency with evaluation of key point localization, a ground truth point is recalled
if there exists a candidate point matched regarding the PCK metric. Table 4.9
shows the localization recall of candidate points selected by inception-4a with
different α values 0.05, 0.02 and 0.01. As expected, candidate points sampled by
layer inception-4a have a great coverage of ground truth using PCK metric with
α > 0.02. However, the recall drops dramatically when using α = 0.01. This
mainly because of the large stride(16) in inception-4a layer, which results in the
75
Table 4.6: Receptive field size of different layers.
Layer Rec. Field
Inception-4a 107× 107Inception-4b 139× 139Inception-4c 171× 171Inception-4d 204× 204
Table 4.7: Comparison of per-part PCK(%) and over-all APK(%) on CUB200-2011. The abbreviated part names from left to right are: Back, Beak, Belly,Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, Right Eye, RightLeg, Right Wing,Tail, and Throat
α Methods Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Avg APK
0.1
[37] 80.7 89.4 79.4 79.9 89.4 88.5 85.0 75.0 67.0 85.7 86.1 77.5 67.8 76.0 90.8 81.2 86.6[124] 85.6 94.9 81.9 84.5 94.8 96.0 95.7 64.6 67.8 90.7 93.8 64.9 69.3 74.7 94.5 83.6 -[118] 94.0 82.5 92.2 93.0 92.2 91.5 93.3 69.7 68.1 86.0 93.8 74.2 68.9 77.4 93.4 84.7 -
Ours(4a) 82.7 94.1 85.3 87.8 95.2 93.3 88.6 75.5 75.9 92.0 89.5 76.6 75.9 67.4 94.7 84.9 89.1Ours(4b) 87.4 93.6 87.4 88.9 95.2 93.7 88.3 73.3 77.6 93.4 88.9 76.3 79.0 70.5 94.5 85.9 88.9Ours(4c) 89.0 95.1 91.5 92.6 95.7 94.7 90.3 78.5 82.3 94.4 91.0 73.2 81.9 78.4 95.7 88.3 90.9Ours(4d) 89.0 95.0 92.2 93.2 95.2 94.2 90.5 73.2 81.5 94.4 91.6 75.5 82.3 83.2 95.8 88.5 91.2
0.05
[37] 48.8 63.7 44.5 50.3 50.2 43.7 80.0 44.8 42.7 60.1 59.4 46.5 39.8 46.8 71.9 52.9 62.7[124] 46.8 62.5 40.7 45.1 59.8 63.7 66.3 33.7 31.7 54.3 63.8 36.2 33.3 39.6 56.9 49.0 -[118] 66.4 49.2 56.4 60.4 61.0 60.0 66.9 32.3 35.8 53.1 66.3 35.0 37.1 40.9 65.9 52.4 -
Ours(4a) 70.6 89.5 69.5 75.0 89.0 87.8 87.1 58.5 57.6 84.6 87.8 59.6 60.2 56.3 90.0 74.9 80.4Ours(4b) 69.2 79.4 69.0 74.5 73.2 72.3 85.7 53.3 58.3 83.7 86.0 55.5 60.1 59.0 86.5 74.5 71.1Ours(4c) 62.3 57.1 67.6 72.2 49.1 47.0 84.6 49.7 57.6 79.3 84.9 44.1 56.9 63.7 82.6 63.0 67.9Ours(4d) 42.3 27.5 59.7 60.6 21.3 23.3 82.2 33.1 49.6 65.6 82.4 37.4 47.5 66.7 69.4 51.3 54.5
0.02
[37] 11.1 16.9 9.1 11.2 5.2 4.1 40.4 9.4 10.8 14.6 9.9 11.9 9.6 11.2 22.3 13.2 13.3[124] 9.4 12.7 8.2 12.2 13.2 11.3 7.8 6.7 11.5 12.5 7.3 6.2 8.2 11.8 56.9 13.1 -[118] 18.8 12.8 14.2 15.9 15.9 16.2 20.3 7.1 8.3 13.8 19.7 7.8 9.6 9.6 18.3 13.8 -
Ours(4a) 24.9 31.0 23.0 28.3 25.1 26.6 44.8 19.6 17.4 38.4 46.9 20.9 20.7 22.0 37.5 28.5 17.2Ours(4b) 19.7 15.8 21.6 24.0 9.1 8.1 40.7 16.0 16.8 32.6 43.1 16.7 17.7 23.6 29.8 22.4 13.5Ours(4c) 12.5 5.9 17.9 17.9 2.6 3.0 41.4 12.0 15.0 22.2 41.4 8.9 14.9 24.0 23.1 17.5 11.8Ours(4d) 6.4 1.9 14.1 11.8 1.0 2.1 36.7 4.9 10.9 15.5 38.5 5.9 10.4 24.0 17.0 13.4 9.3
distance between two closest candidate points is 16 pixels while setting an input
size of 448 with α = 0.01 requires the candidate point should be close to the
ground truth within 4.48 pixels.
The part localization architecture adopted in DPS-CNN achieves a highest av-
erage [email protected] 88.5% on the CUB-200-2011 test set for 15 object parts. Specif-
ically, the employed Gaussian smoothing kernel delivers 2% improvements over
methods that use standard convolutional layers in BN-GoogleNet. Figure 4.8
shows typical localization results using the proposed method.
76
Table 4.8: Localization recall of candidate points selected by inception-4a layerwith different α values. The abbreviated part names from left to right are: Back,Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, RightEye, Right Leg, Right Wing, Tail, and Throat
Part Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th
Accuracy(%) 47.9 63.7 43.9 56.8 66.8 66.1 36.6 30.8 30.4 64.8 36.1 29.2 29.7 20.0 68.7
Groundtruth
Prediction
Back
Beak
Belly
Breast
Crown
Forehead
Left Eye
Left Leg
Left Wing
Right Eye
Right Wing
Nape
Tail
Right Leg
Throat
Groundtruth
Prediction
Figure 4.9: Typical localization results on CUB-200-2011 test set. Better viewedin color.
77
Figure 4.10: Feature maps visualization of Inception-4a layer. Each exampleimage is followed by three rows of top six scoring feature maps, which are from thepart stream, object stream and and baseline BN-inception network respectively.Red dash box indicates a failure case of visualization using the model learned byour approach.
78
4.5.5 Classification results for DPSCNN
We begin our classification analysis by studying the discriminative power of each
object part. We select one object part each time as the input and discard the
computation of all other parts. As shown in Table 4.8, different parts produce sig-
nificantly different classification results. The most discriminative part ”Throat”
achieves a quite impressive accuracy of 68.7%, while the lowest accuracy is 20.0%
for the part ”Tail”. Therefore, to improve classification, it may be beneficial to
find a rational combination or order of object parts instead of directly running
the experiment on all parts altogether. More interestingly, when comparing the
results between Table 4.7 and Table 4.8 it can be seen that parts located more
accurately such as Throat, Nape, Forehead and Beak tend to achieve better per-
formance in the recognition task, while some parts like Tail and Left Leg with
poor localization accuracy perform worse. This observation may support the
hypothesis that a more discriminative part is easier to locate in the context of
fine-grained categorization and vice versa.
To evaluate our frameworks overall performance, we first train a baseline
model with accuracy 81.56% using a BN-Inception architecture [38] with pre-
training on ImageNet [79]. By stacking certain part features and applying our
proposed fusion method, our framework improves the performance to 85.12%.
Also, to evaluate our proposed feature fusion method, we then train four DPS-
CNN models with same experimental settings (maximum iteration and learning
rate) but using different feature fusion methods. The results are shown in Table
?? (Rows 2-5) demonstrate that SMM fusion achieves the best performance and
outperforms the FC method by 1.69%.
To investigate which parts should be selected in our learning framework, we
conduct the following experiments by employing two guiding principles: one con-
cerns the feature discrimination and the other feature diversity. Here we consider
parts with higher accuracy in Table ?? are more discriminative, and the combina-
tion of parts with distant location are more diverse. We firstly select top 6 parts
with the highest accuracy from Table ?? by only applying the discriminative prin-
ciple, then choose 3,5,9 and 15 parts respectively by taking two principles into
account. Experimental results are shown in Table ?? (Row 6-10), we observe that
79
Table 4.9: Localization recall of candidate points selected by inception-4a layerwith different α values. The abbreviated part names from left to right are: Back,Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, RightEye, Right Leg, Right Wing, Tail, and Throat
α Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Avg
0.05 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 1000.02 90.8 89.8 90.8 90.4 90.9 91.4 90.4 90.4 90.0 90.7 90.3 89.9 90.3 90.5 90.3 90.50.01 26.8 26.3 9.1 11.2 5.2 4.1 40.4 9.4 10.8 14.6 9.9 11.9 9.6 11.2 22.3 13.2
Table 4.10: Comparison of different settings of our approach on CUB200-2011 .
Row Setting Acc(%)
1 Object Only(Baseline) 81.56
2 5-parts + FC 81.863 5-parts + SS 83.064 5-parts + SM 83.415 5-parts + SMM 83.55
6 6-parts + SMM 84.12
7 3-parts + SMM 84.298 5-parts + SMM 84.919 9 parts + SMM 85.1210 15-parts + SMM 84.45
increasing part numbers bring slight improvement. However, all setting perform
better than that with six most discriminative parts. This mainly because most of
these parts are adjacent to each other so that it fails to produce diverse feature
in our framework. Also, it should be noticed that using all parts feature does
not guarantee the best performance, on the other hand, results in poor accuracy.
This finding shows that the feature redundancy caused by appending the exor-
bitant number of parts in learning, may degrade the accuracy, and suggests that
an appropriate strategy for integrating multiple parts is critical.
We also present the performance comparison between DPS-CNN and existing
fine-grained recognition methods. As can be seen in Table 4.11, our approach
using only keypoint annotation during training achieve 85.12% accuracy which
is comparable with the state-of-the-art method [52] that achieves 85.10% using
bounding box both in training and testing. Moreover, it is interpretable and faster
80
MethodTraining Testing
Pre-trained Model. FPS 2 Acc(%).
BBox Parts BBox PartsPart-Stacked CNN [37] � � � AlexNet 20 76.62Deep LAC [50] � � � AlexNet - 80.26Part R-CNN [120] � � � AlexNet - 76.37SPDA-CNN [119] � � � VGG16 - 84.55SPDA-CNN [119]+ensemble � � � VGG16 - 85.14Part R-CNN [120] without BBox � � AlexNet - 73.89PoseNorm CNN [11] � � AlexNet - 75.70Bilinear-CNN (M+D+BBox) [52] � � VGG16+VGGM 8 85.10Bilinear-CNN (M+D) [52] VGG16+VGGM 8 84.10Constellation-CNN [86] VGG19 - 84.10Spatial Transformer CNN [39] Inception+BN - 84.10Two-Level [111] VGG16 - 77.90Co-Segmentation [46] � � VGG19 - 82.80DPS-CNN with 9 parts � Inception+BN 32 85.12DPS-CNN ensemble with 4 models � Inception+BN 8 86.56
Table 4.11: Comparison with state-of-the-art methods on the CUB-200-2011dataset.
- the entire forward pass of DPS-CNN runs at 32 frames/sec (NVIDIA TitanX),
while B-CNN[D, M] [52] runs at 8 frames/sec (NVIDIA K40)1. In particular, our
method is much faster than proposal based methods such as [120] and [119] which
require multiple networks forward propagation for proposal evaluation, while part
detection and feature extraction are accomplished efficiently by running one for-
ward pass in our approach. In addition, we combine four models stemmed from
integrating different parts(listed in Table 4.10 (Row 7-10)) to form an ensemble
which leads to 86.56% accuracy on cub-200-2011.
To understand what features are learned in DPS-CNN, we use the aforemen-
tioned five-parts model and show its feature map visualization compared with
that from BN-Inception model fine-tuning on cub-200-2011. Specifically, we pick
the top six scoring feature maps of Inception-4a layer for visualization, where the
score is the sum over each feature map. As shown in Figure 4.10, each example
image from test set is followed by three rows of feature maps, from top row to
bottom, which is selected from the part stream, object stream, and BN-inception
baseline network respectively. Interestingly, by comparison, our part stream have
learned feature maps that appear to be more intuitive than those learned by
the other two methods. Specifically, it yields more focused and cleaner patterns
1note that the computational power of TitanX is around 1.5 times of that of K40).
81
which tend to be highly activated by the network. Moreover, we can observe
that object stream and baseline network are more likely to activate filters with
extremely high-frequency details but at the expense of extra noise, while part
stream tends to obtain a mixture of low and mid frequency information. The red
dashed box in Figure 4.10 indicates a failure example, in which both our part
stream and object stream fails to learn useful feature. This may be caused by
our part localization network fails to locate Crown and Left Leg parts because
the branch in this image looks similar to bird legs and another occluded bird also
has an effect on locating the Crown part.
4.5.6 Model interpretation
One of the most prominent features of DPS-CNN method is that it can pro-
duce human-understandable interpretation manuals for fine-grained recognition.
Here we directly borrow the idea from [37] for interpretation using the proposed
method.
Different from [6] who directly conducted one-on-one classification on object
parts, the interpretation process of the proposed method is conducted relatively
indirectly. Since using each object part alone does not produce convincing clas-
sification results, we perform the interpretation analysis on a combination of
bounding box supervision and each single object part. The analysis is performed
in two ways: a ”one-versus-rest” comparison to denote the most discriminative
part to classify a subcategory from all other classes, and a ”one-versus-one” com-
parison to obtain the classification criteria of a subcategory with its most similar
classes.
• The “one-versus-rest” manual for an object category k. For every part p,
we compute the summation of prediction scores of the category’s positive
samples. The most discriminative part is then captured as the one with the
largest accumulated score:
p∗k = argmaxp
∑i,yi=k
S(p)ip . (4.6)
• The “one-versus-one” manual obtained by computing as the part which
82
results in the largest difference of prediction scores on two categories k and
l. We first take the respective two rows in the score matrix S, and re-
normalize it using the binary classification criterion as S ′. Afterwards, the
most discriminative part is given as:
p∗k→l = argmaxp
(∑i,yi=k
S′(p)ip +
∑j,yj=l
S′(p)jp ) (4.7)
The model interpretation routine is demonstrated in Figure 4.11. When a
test image is presented, the proposed method first conducts object classification
using the DPS-CNN architecture. The predicted category is presented as a set of
images in the dataset that are closest to the test image according to the feature
vector of each part. Except for the classification results, the proposed method
also presents classification criteria that distinguish the predicted category from
its most similar neighboring classes based on object parts. Again we use part
features but after part, cropping to retrieve nearest neighbor part patches of the
input test image. The procedure described above provides an intuitive visual
guide for distinguishing fine-grained categories.
4.6 Conclusion
In this chapter, we propose two CNN structures for fine-grained recognition,
which are Part-Stacked CNN (PS-CNN) and Deeper Part-Stacked CNN (DPS-
CNN). We design PS-CNN using simple structure for efficient inference, DPS-
CNN with deeper layers for accuracy. Both methods exploit detailed part-level
supervision, in which object parts are first located by a localization network and
then by a two-stream classification system that explicitly captures object- and
part-level information. We also present a new feature vector fusion strategy that
effectively combines both part and object stream features. Experiments on CUB-
200-2011 demonstrate the effectiveness and efficiency of our system. We also
present human-understandable interpretations of the proposed method, which
can be used as a visual field guide for studying fine-grained categorization.
It is also worth nothing that our methods can be applied to fine-grained visual
83
categorization with strong supervision and can be easily generalized to various
applications including:
a) Discarding the requirement for strong supervision. Instead of introducing
manually labeled part annotations to generate human-understandable vi-
sual guides, one can also exploit unsupervised part discovery methods [46]
to define object parts automatically, which requires far less human labeling
effort.
b) Attribute learning. The application of our approaches are not restricted to
FGVC. For instance, online shopping [60] performance could benefit from
clothing attribute analysis from local parts provided by our methods.
c) Context-based CNN. The role of local parts in our method is interchange-
able with global contexts, in particular for objects that are small and have
no apparent object parts such as volleyballs or tennis balls.
84
85
back (0.8662) nape (0.8600) left eye (0.8594)
vs. Hooded Oriole
forehead (0.9165) crown (0.9152) right eye (0.9143)
crown (0.9295) back (0.9271) forehead (0.9267)
Similar Class Comparison Predict Class Test Image
crown belly
Important Parts
vs. Boat tailed
Grackle
vs. Rusty
Blackbird
Yellow Headed
Blackbird
part class
part class
part class
Figure 4.11: Example of the prediction manual generated by the proposed ap-proach. Given a test image, the system reports its predicted class label with sometypical exemplar images. Part-based comparison criteria between the predictedclass and its most similar classes are shown in the right part of the image. Thenumber in brackets shows the confidence of classifying two categories by introduc-ing a specific part. We present top three object parts for each pair of comparison.For each of the parts, three part-center-cropped patches are shown for the pre-dicted class (upper rows) and the compared class (lower rows) respectively.
86
Chapter 5
Conclusions
Keypoint localization is considered as a fundamental step for image understand-
ing. Many important tasks such as object detection, object recognition and pose
estimation can greatly benefit from such a technique. The major challenges in
keypoint localization includes highly variable appearance, occlusion, high compu-
tational complexity, and insufficient annotation data. To improve the localization
accuracy and reduce the computational cost, Chapter 2 propose hierarchically su-
pervised nets (HSNs), a method that imposes hierarchical supervision within deep
convolutional neural networks (CNNs). We also explore the problem of insuffi-
cient data annotation for keypoint localization in Chapter 3. Finally, Chapter
4 explores the effectiveness of using part localization technique in addressing the
problem of fine-grained visual categorization.
Existing works mainly perform object detection and keypoint localizatoin in
two stages. However, these two tasks can complement each other, thereby learn-
ing the bounding box regression and keypoint location jointly is a valuable future
work. Also, another future direction in this area is training a semantic part de-
tector in semi-supervised or unsupervised setting has not yet been well explored,
though there has been increasing interest in discovering discriminative parts in
recent years.
87
88
References
[1] A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular
images,” IEEE transactions on pattern analysis and machine intelligence,
vol. 28, no. 1, pp. 44–58, 2006. 4
[2] Y. Amit and A. Trouve, “Pop: Patchwork of parts models for object recog-
nition,” International Journal of Computer Vision, vol. 75, no. 2, pp. 267–
282, 2007. 2, 8
[3] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: Peo-
ple detection and articulated pose estimation,” in CVPR, 2009. 3, 19
[4] ——, “Monocular 3d pose estimation and tracking by detection,” in Com-
puter Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.
IEEE, 2010, pp. 623–630. 3
[5] A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for large-scale sub-
category flower recognition,” in WACV. IEEE, 2013, pp. 39–45. 10, 49
[6] T. Berg and P. Belhumeur, “Poof: Part-based one-vs.-one features for
fine-grained categorization, face verification, and attribute estimation,” in
CVPR, 2013. 3, 10, 50, 52, 55, 74, 82
[7] T. Berg and P. N. Belhumeur, “How do you tell a blackbird from a crow?”
in ICCV, 2013. 10, 50
[8] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Bel-
humeur, “Birdsnap: Large-scale fine-grained visual categorization of birds,”
in CVPR, 2014. 10, 49, 50
89
REFERENCES
[9] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,” in
NIPS, 2010. 55
[10] L. Bo, C. Sminchisescu, A. Kanaujia, and D. Metaxas, “Fast algorithms
for large scale conditional 3d prediction,” in Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp.
1–8. 4
[11] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species cate-
gorization using pose normalized deep convolutional nets,” arXiv preprint
arXiv:1406.2952, 2014. 10, 50, 52, 55, 74, 81
[12] S. Branson, G. Van Horn, C. Wah, P. Perona, and S. Belongie, “The ig-
norant led by the blind: A hybrid human–machine vision system for fine-
grained categorization,” IJCV, vol. 108, no. 1-2, pp. 3–29, 2014. 10, 50
[13] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and
S. Belongie, “Visual recognition with humans in the loop,” in ECCV, 2010.
55
[14] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d
pose estimation using part affinity fields,” in CVPR, 2017. 16, 19, 30
[15] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose esti-
mation with iterative error feedback,” in CVPR, 2016. 19
[16] Y. Chai, V. Lempitsky, and A. Zisserman, “Symbiotic segmentation and
part localization for fine-grained categorization,” in ICCV, 2013. 3, 10, 18,
38, 50
[17] X. Chu, W. Ouyang, H. Li, and X. Wang, “Structured feature learning
for pose estimation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016. 19
[18] T. F. Cootes, G. J. Edwards, C. J. Taylor et al., “Active appearance mod-
els,” TPAMI, vol. 23, no. 6, pp. 681–685, 2001. 52
90
REFERENCES
[19] N. Dalal and B. Triggs, “Histograms of oriented gradients for human de-
tection,” in CVPR, 2005. 4, 18
[20] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose estimation
using body parts dependent joint regressors,” in CVPR, 2013. 19
[21] J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-
grained recognition,” in CVPR, 2013. 10, 50, 55
[22] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
T. Darrell, “Decaf: A deep convolutional activation feature for generic vi-
sual recognition,” arXiv preprint arXiv:1310.1531, 2013. 74
[23] R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, and L. S. Davis,
“Birdlets: Subordinate categorization using volumetric primitives and pose-
normalized appearance,” in ICCV. IEEE, 2011, pp. 161–168. 8
[24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Ob-
ject detection with discriminatively trained part-based models,” TPAMI,
vol. 32, no. 9, pp. 1627–1645, 2010. 2, 8, 18, 38, 60, 68
[25] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object
recognition,” IJCV, vol. 61, no. 1, pp. 55–79, 2005. 18
[26] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised
visual domain adaptation using subspace alignment,” in Proceedings of the
IEEE International Conference on Computer Vision, 2013, pp. 2960–2967.
39
[27] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by back-
propagation,” in Proceedings of the 32nd International Conference on Ma-
chine Learning (ICML-15), 2015, pp. 1180–1189. 36, 39, 42, 43
[28] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars,
“Fine-grained categorization by alignments,” in ICCV, 2013. 74
[29] ——, “Local alignments for fine-grained categorization,” IJCV, vol. 111,
no. 2, pp. 191–212, 2015. 55
91
REFERENCES
[30] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi, “Domain gen-
eralization for object recognition with multi-task autoencoders,” in Pro-
ceedings of the IEEE International Conference on Computer Vision, 2015,
pp. 2551–2559. 39
[31] R. Girshick, “Fast r-cnn,” in ICCV, 2015. 53
[32] G. Gkioxari, R. Girshick, and J. Malik, “Actions and attributes from wholes
and parts,” in CVPR, 2015. 52, 55
[33] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsu-
pervised domain adaptation,” in Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2066–2073. 39
[34] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using regions,”
in CVPR, 2009. 20
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016. 1, 9, 16, 19
[36] A. Holub, P. Perona, and M. C. Burl, “Entropy-based active learning for
object recognition,” in Computer Vision and Pattern Recognition Work-
shops, 2008. CVPRW’08. IEEE Computer Society Conference on. IEEE,
2008, pp. 1–8. 39
[37] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for fine-grained
visual categorization,” in CVPR, 2016. 8, 16, 52, 61, 69, 75, 76, 81, 82
[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in ICML, 2015. 9, 51, 61, 79
[39] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer net-
works,” in NIPS, 2015. 81
[40] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast
feature embedding,” in ICM, 2014. 34, 56, 61, 70
92
REFERENCES
[41] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learn-
ing for image classification,” in Computer Vision and Pattern Recognition,
2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 2372–2379. 39
[42] A. Kanaujia, C. Sminchisescu, and D. Metaxas, “Semi-supervised hierar-
chical models for 3d human pose reconstruction,” in Computer Vision and
Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007,
pp. 1–8. 4
[43] L. Karlinsky and S. Ullman, “Using linking features in learning non-
parametric part models,” in ECCV, 2012. 19
[44] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for
fine-grained image categorization: Stanford dogs,” in Proc. CVPR Work-
shop on Fine-Grained Visual Categorization (FGVC), 2011. 10, 49
[45] M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial
landmarks in the wild: A large-scale, real-world database for facial land-
mark localization,” in Computer Vision Workshops (ICCV Workshops),
2011 IEEE International Conference on. IEEE, 2011, pp. 2144–2151. 1
[46] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without
part annotations,” in CVPR, 2015. 49, 55, 74, 81, 84
[47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105. 9, 10,
50, 51, 55, 56, 61
[48] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C.
Lopez, and J. V. Soares, “Leafsnap: A computer vision system for auto-
matic plant species identification,” in ECCV, 2012. 10, 50
[49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998. 9
93
REFERENCES
[50] D. Lin, X. Shen, C. Lu, and J. Jia, “Deep lac: Deep localization, alignment
and classification for fine-grained recognition,” in CVPR, 2015. 52, 53, 69,
73, 74, 81
[51] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in con-
text,” in ECCV, 2014. 29, 35
[52] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-
grained visual recognition,” in ICCV, 2015. x, 10, 50, 51, 52, 55, 73, 74,
80, 81
[53] Z. Lin, G. Hua, and L. S. Davis, “Multiple instance ffeature for robust
part-based object detection,” in Computer Vision and Pattern Recognition,
2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 405–412. 2, 8
[54] J. Liu and P. N. Belhumeur, “Bird part localization using exemplar-based
models with enforced pose and subcategory consistency,” in ICCV, 2013.
18, 25, 31, 52
[55] J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur, “Dog breed classifica-
tion using part localization,” in European Conference on Computer Vision.
Springer, 2012, pp. 172–185. 8
[56] J. Liu, Y. Li, and P. N. Belhumeur, “Part-pair representation for part
localization,” in ECCV, 2014. 1, 16, 18, 25, 31, 52
[57] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in ECCV, 2015. 19, 36, 53, 57, 62
[58] J. L. Long, N. Zhang, and T. Darrell, “Do convnets learn correspondence?”
in NIPS, 2014. 70, 74
[59] D. G. Lowe, “Object recognition from local scale-invariant features,” in
ICCV, 1999. 4
94
REFERENCES
[60] K. M. Hadi, H. Xufeng, L. Svetlana, B. Alexander, and B. Tamara, “Where
to buy it: Matching street clothing photos in online shops,” in ICCV, 2015.
84
[61] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained
visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013. 10,
11, 49, 50, 55
[62] S. Maji and G. Shakhnarovich, “Part and attribute discovery from relative
annotations,” IJCV, vol. 108, no. 1-2, pp. 82–96, 2014. 3, 10, 50, 52, 55
[63] O. Matan, C. J. Burges, Y. Le Cun, and J. S. Denker, “Multi-digit recog-
nition using a space displacement neural network,” 1995. 57
[64] I. Matthews and S. Baker, “Active appearance models revisited,” IJCV,
vol. 60, no. 2, pp. 135–164, 2004. 52
[65] S. Milborrow and F. Nicolls, “Locating facial features with an extended
active shape model,” in ECCV, 2008. 52
[66] R. Navaratnam, A. W. Fitzgibbon, and R. Cipolla, “The joint manifold
model for semi-supervised multi-valued regression,” in Computer Vision,
2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007,
pp. 1–8. 4
[67] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human
pose estimation,” in CVPR, 2016. 4, 16, 19, 38
[68] M.-E. Nilsback and A. Zisserman, “Automated flower classification over
a large number of classes,” in Computer Vision, Graphics & Image Pro-
cessing, 2008. ICVGIP’08. Sixth Indian Conference on. IEEE, 2008, pp.
722–729. 10, 49
[69] D. Novotny, D. Larlus, and A. Vedaldi, “I have seen enough: Transfer-
ring parts across categories,” in Proceedings of the British Machine Vision
Conference (BMVC), 2016. 36
95
REFERENCES
[70] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,
and K. Murphy, “Towards accurate multi-person pose estimation in the
wild,” in CVPR, 2017. 16, 19, 30, 34
[71] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,”
in CVPR. IEEE, 2012, pp. 3498–3505. 8, 10, 49, 55
[72] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet conditioned
pictorial structures,” in CVPR, 2013. 3, 19
[73] ——, “Strong appearance and expressive spatial models for human pose
estimation,” in ICCV, 2013. 1, 3, 16, 19
[74] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler,
and B. Schiele, “Deepcut: Joint subset partition and labeling for multi
person pose estimation,” in CVPR, 2016. 16, 19
[75] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh, “Pose
machines: Articulated pose estimation via inference machines,” in ECCV,
2014. 1, 16, 19
[76] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-
the-shelf: an astounding baseline for recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops,
2014, pp. 806–813. 74
[77] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in NIPS, 2015. 20, 33, 62,
63
[78] E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, and P. Boyes-Braem,
“Basic objects in natural categories,” Cognitive psychology, vol. 8, no. 3,
pp. 382–439, 1976. 3, 10, 50
[79] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual
recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015. 35, 79
96
REFERENCES
[80] J. Sanchez, F. Perronnin, and Z. Akata, “Fisher vectors for fine-grained
visual categorization,” in CVPR, 2011. 55
[81] J. M. Saragih, S. Lucey, and J. F. Cohn, “Face alignment through subspace
constrained mean-shifts,” in ICCV, 2009, pp. 1034–1041. 52
[82] P. Schnitzspan, S. Roth, and B. Schiele, “Automatic discovery of meaningful
object parts with latent crfs,” in Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 121–128. 2, 8
[83] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estimation with
parameter-sensitive hashing,” in null. IEEE, 2003, p. 750. 4
[84] K. J. Shih, A. Mallya, S. Singh, and D. Hoiem, “Part localization using
multi-proposal consensus for fine-grained categorization,” in BMVC, 2015.
16, 19, 25, 31, 38, 52, 53, 73, 74
[85] L. Sigal, R. Memisevic, and D. J. Fleet, “Shared kernel information embed-
ding for discriminative inference,” in Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 2852–2859.
4
[86] M. Simon and E. Rodner, “Neural activation constellations: Unsupervised
part model discovery with convolutional networks,” in ICCV, 2015. 39, 55,
74, 81
[87] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in ICLR, 2015. 1, 9, 16, 55
[88] S. Singh, D. Hoiem, and D. Forsyth, “Learning a sequential search for
landmarks,” in CVPR, 2015. 16
[89] M. Stark, J. Krause, B. Pepik, D. Meger, J. J. Little, B. Schiele, and
D. Koller, “Fine-grained categorization for 3d scene understanding,” Inter-
national Journal of Robotics Research, vol. 30, no. 13, pp. 1543–1552, 2011.
10, 49
97
REFERENCES
[90] M. Sun and S. Savarese, “Articulated part-based model for joint object
detection and pose estimation,” in ICCV, 2011. 19
[91] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-
resnet and the impact of residual connections on learning,” arXiv preprint
arXiv:1602.07261, 2016. 9
[92] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
CVPR, 2015. 1, 2, 9, 16, 17, 20, 23, 55
[93] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–
2826. 9
[94] Y. Tian, C. L. Zitnick, and S. G. Narasimhan, “Exploring the spatial hi-
erarchy of mixture models for human pose estimation,” in ECCV, 2012.
19
[95] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a
convolutional network and a graphical model for human pose estimation,”
in NIPS, 2014. 4, 16, 19, 57
[96] S. Tong and D. Koller, “Support vector machine active learning with appli-
cations to text classification,” Journal of machine learning research, vol. 2,
no. Nov, pp. 45–66, 2001. 39
[97] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep
neural networks,” in CVPR, 2014. 4, 19
[98] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep trans-
fer across domains and tasks,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 4068–4076. 39
[99] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep
domain confusion: Maximizing for domain invariance,” arXiv preprint
arXiv:1412.3474, 2014. 39
98
REFERENCES
[100] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Se-
lective search for object recognition,” IJCV, vol. 104, no. 2, pp. 154–171,
2013. 63
[101] R. Urtasun and T. Darrell, “Sparse probabilistic regression for activity-
independent human pose inference,” in Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp.
1–8. 4
[102] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Per-
ona, and S. Belongie, “Building a bird recognition app and large scale
dataset with citizen scientists: The fine print in fine-grained dataset collec-
tion,” in CVPR, 2015. 10, 50
[103] A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji, R. Girshick, J. Kannala,
E. Rahtu, I. Kokkinos, M. B. Blaschko, D. Weiss et al., “Understanding
objects in detail with fine-grained attributes,” in CVPR, 2014. 49, 55
[104] C. Wah, S. Branson, P. Perona, and S. Belongie, “Multiclass recognition
and part localization with humans in the loop,” in ICCV, 2011. 8, 49
[105] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-
ucsd birds-200-2011 dataset,” 2011. 10, 11, 29, 31, 44, 49, 50, 55, 70
[106] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Be-
longie, “Similarity comparisons for interactive fine-grained categorization,”
in CVPR, 2014. 55
[107] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple
granularity descriptors for fine-grained categorization,” in ICCV, 2015. 10,
50
[108] J. Wang, K. Markert, and M. Everingham, “Learning models for object
recognition from natural language descriptions.” in BMVC, vol. 1, 2009,
p. 2. 8
[109] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional
pose machines,” in CVPR, 2016. 16, 19, 38, 53, 64
99
REFERENCES
[110] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and
P. Perona, “Caltech-ucsd birds 200,” 2010. 10, 49
[111] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application
of two-level attention models in deep convolutional neural network for fine-
grained image classification,” in CVPR, 2015. 38, 74, 81
[112] Z. Xu, S. Huang, Y. Zhang, and D. Tao, “Augmenting strong supervision
using web data for fine-grained categorization,” in ICCV, 2015. 49
[113] ——, “Webly-supervised fine-grained visual categorization via deep domain
adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 2016. 38
[114] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept de-
tection using adaptive svms,” in Proceedings of the 15th ACM international
conference on Multimedia. ACM, 2007, pp. 188–197. 39
[115] W. Yang, W. Ouyang, H. Li, and X. Wang, “End-to-end learning of de-
formable mixture of parts and deep convolutional neural networks for hu-
man pose estimation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016. 19
[116] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible
mixtures-of-parts,” in CVPR, 2011. 1, 16, 19
[117] ——, “Articulated human detection with flexible mixtures of parts,”
TPAMI, vol. 35, no. 12, pp. 2878–2890, 2013. 70, 75
[118] X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network for object
landmark localization,” arXiv preprint arXiv:1605.01014, 2016. 25, 31, 52,
75, 76
[119] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and
D. Metaxas, “Spda-cnn: Unifying semantic part detection and abstraction
for fine-grained recognition,” in CVPR, 2016. 8, 16, 19, 38, 51, 52, 53, 81
100
REFERENCES
[120] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for
fine-grained category detection,” in ECCV, 2014. 3, 10, 11, 18, 50, 52, 55,
57, 59, 67, 69, 73, 74, 81
[121] N. Zhang, R. Farrell, and T. Darrell, “Pose pooling kernels for sub-category
recognition,” in Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on. IEEE, 2012, pp. 3665–3672. 8
[122] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, “Deformable part descrip-
tors for fine-grained recognition and attribute prediction,” in CVPR, 2013.
18, 38
[123] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, “Panda:
Pose aligned networks for deep attribute modeling,” in CVPR, 2014. 52,
55
[124] N. Zhang, E. Shelhamer, Y. Gao, and T. Darrell, “Fine-grained pose pre-
diction, normalization, and recognition,” arXiv preprint arXiv:1511.07063,
2015. 8, 16, 25, 31, 51, 52, 53, 55, 64, 75, 76
[125] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter
responses for fine-grained image recognition,” in CVPR, 2016, pp. 1134–
1142. 51, 52
[126] X. Zhang, H. Xiong, W. Zhou, and Q. Tian, “Fused one-vs-all mid-level
features for fine-grained visual categorization,” in Proceedings of the ACM
International Conference on Multimedia. ACM, 2014, pp. 287–296. 3, 10,
50
[127] Y. Zhang, X.-s. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N.
Do, “Weakly supervised fine-grained image categorization,” arXiv preprint
arXiv:1504.04943, 2015. 74
[128] F. Zhou, J. Brandt, and Z. Lin, “Exemplar-based graph matching for ro-
bust facial landmark localization,” in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 1025–1032. 1
101
REFERENCES
[129] J. Zhu, X. Chen, and A. L. Yuille, “Deepm: A deep part-based model
for object detection and semantic part localization,” arXiv preprint
arXiv:1511.07131, 2015. 52, 55
[130] L. Zhu, Y. Chen, A. Yuille, and W. Freeman, “Latent hierarchical structural
learning for object detection,” in Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1062–1069. 2, 8
[131] X. Zhu and D. Ramanan, “Face detection, pose estimation, and land-
mark localization in the wild,” in Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2879–2886. 1
[132] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from
edges,” in ECCV, 2014. 19, 38
102
REFERENCES
103