Deep Representation Learning for Keypoint localization · Deep Representation Learning for Keypoint...

Deep Representation Learning

for Keypoint localization

Shaoli Huang

Faculty of Engineering and Information Technology

University of Technology Sydney

A thesis submitted for the degree of

Doctor of Philosophy

2017

To my family

Mingjiang Liang and Jingyi Huang

Certificate of Original Authorship

I certify that the work in this thesis has not previously been submitted

for a degree nor has it been submitted as part of requirements for a

degree except as fully acknowledged within the text.

I also certify that the thesis has been written by me. Any help that I

have received in my research work and the preparation of the thesis

itself has been acknowledged. In addition, I certify that all informa-

tion sources and literature used are indicated in the thesis.

Shaoli Huang

Acknowledgements

First and foremost, I would like to thank my supervisorProf. Dacheng

Tao, who not only guide me to the field of computer vision but also

provide me advice on life and careers.

I would also like to thank my parents, my brother and my sisters

for giving me love and support. I am very thankful to my dear wife

Mingjiang Liang, who has been with me these years. She takes care of

the family and allows me spending more time on the research study. I

am also thankful for the unwavering love and general happiness that

she has brought into my life. Along with her, I want to thank my

daughter, Jingyi Huang. She has been a pure joy and has made my

life much more fun. I am also thankful to my mother-in-law Fengying

Lei, who takes care of my family when I was writing the thesis.

I also would like to give special thanks to Mingming Gong for numer-

ous discussions that have played a significant role in bringing clarity

to my ideas. I also would like to thank Dr. Jun Li and Dr. Zhe Xu

who spend much time on having a discussion with me.

Finally, I would like like to thank my colleagues and friends I met in

Sydney: Shirui Pan, Ruxin Wang, Tongliang Liu, Chang Xu, Haishuai

Wang, Huan Fu and so many others.

Abstract

Keypoint localization aims to locate points of interest from the in-

put image. This technique has become an important tool for many

computer vision tasks such as fine-grained visual categorization, ob-

ject detection, and pose estimation. Tremendous effort, therefore, has

been devoted to improving the performance of keypoint localization.

However, most of the proposed methods supervise keypoint detectors

using a confidence map generated from ground-truth keypoint loca-

tions. Furthermore, the maximum achievable localization accuracy

differs from keypoint to keypoint, because it is determined by the un-

derlying keypoint structures. Thus the keypoint detector often fails

to detect ambiguous keypoints if trained with strict supervision, that

is, permitting only a small localization error. Training with looser su-

pervision could help detect the ambiguous keypoints, but this comes

at a cost to localization accuracy for those keypoints with distinctive

appearances. In this thesis, we propose hierarchically supervised nets

(HSNs), a method that imposes hierarchical supervision within deep

convolutional neural networks (CNNs) for keypoint localization. To

achieve this, we firstly propose a fully convolutional Inception network

with several branches of varying depths to obtain hierarchical feature

representations. Then, we build a coarse part detector on top of each

branch of features and a fine part detector which takes features from

all the branches as the input.

Collecting image data with keypoint annotations is harder than with

image labels. One may collect images from Flickr or Google images

by searching keywords and then perform refinement processes to build

a classification dataset, while keypoint annotation requires human to

click the rough location of the keypoint for each image. To address the

problem of insufficient part annotations, we propose a part detection

framework that combines deep representation learning and domain

adaptation within the same training process. We adopt one of the

coarse detector from HSNs as the baseline and perform a quantita-

tive evaluation on CUB200-2011 and BirdSnap dataset. Interestingly,

our method trained on only 10 species images achieves 61.4% PCK

accuracy on the testing set of 190 unseen species.

Finally, we explore the application of keypoint localization in the

task of fine-grained visual categorization. We propose a new part-

based model that consists of a localization module to detect object

parts (where pathway) and a classification module to classify fine-

grained categories at the subordinate level (what pathway). Exper-

imental results reveal that our method with keypoint localization

achieves the state-of-the-art performance on Caltech-UCSD Birds-

200-2011 dataset.

Contents

Contents i

List of Figures v

List of Tables ix

1 Introduction 1

1.1 Objectives and Motivation . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Keypoints Localization . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 5

1.2.3 Bird Part Localization . . . . . . . . . . . . . . . . . . . . 8

1.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 9

1.4 Fine-grained Visual Categorization . . . . . . . . . . . . . . . . . 10

1.5 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . 11

1.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Hierarchically Supervisided Nets for Keypoint Localization 14

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Bird part detection . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Human pose estimation . . . . . . . . . . . . . . . . . . . . 19

2.3 Hierarchically Supervised Nets . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . 20

i

CONTENTS

2.3.2 Learning and Inference . . . . . . . . . . . . . . . . . . . . 25

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Bird Part Localization . . . . . . . . . . . . . . . . . . . . 31

2.4.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 33

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Transferring Part Locations Across Fine-grained Categories 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Relate Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.1 Part Detection. . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Domain Adaptation and Active Learning . . . . . . . . . . 39

3.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . 39

3.3.2 Optimization with Backpropagation . . . . . . . . . . . . . 43

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Datasets and Setting . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . 45

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Fine-grained Categorization with Part Localization 48

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.1 Keypoint Localization . . . . . . . . . . . . . . . . . . . . 52

4.2.2 Fine-Grained Visual Categorization . . . . . . . . . . . . . 53

4.3 Part-Stacked CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 Localization Network . . . . . . . . . . . . . . . . . . . . . 57

4.3.2 Classification network . . . . . . . . . . . . . . . . . . . . 58

4.4 Deeper Part-Stacked CNN . . . . . . . . . . . . . . . . . . . . . . 61

4.4.1 Localization Network . . . . . . . . . . . . . . . . . . . . . 62

4.4.2 Classification network . . . . . . . . . . . . . . . . . . . . 67

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5.1 Dataset and implementation details . . . . . . . . . . . . . 70

4.5.2 Localization results for PSCNN . . . . . . . . . . . . . . . 70

ii

CONTENTS

4.5.3 Classification results for PSCNN . . . . . . . . . . . . . . . 72

4.5.4 Localization Results for DPSCNN . . . . . . . . . . . . . . 74

4.5.5 Classification results for DPSCNN . . . . . . . . . . . . . . 79

4.5.6 Model interpretation . . . . . . . . . . . . . . . . . . . . . 82

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5 Conclusions 87

References 89

iii

CONTENTS

iv

List of Figures

1.1 Illustrating the pose estimation problem. . . . . . . . . . . . . . . 6

1.2 Illustrating the challenges of human pose estimation. . . . . . . . 7

1.3 Illustrating the bird part localizatoin problem. . . . . . . . . . . . 8

2.1 An illustration of the predicted keypoints from our HSN architec-

ture. The left image contains highly accurate keypoints detected

by the fine detector with strict supervision, the middle image con-

tains keypoints from coarse detectors with loose supervisions, and

the right image shows the final predictions by unifying the fine and

coarse detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Network architecture of the hierarchically supervised nets. The

coarse stream learns three coarse detectors using hierarchical su-

pervisions and while the fine stream learns a fine detector via strict

supervision. Then the coarse predictions and fine predictions are

unified for final prediction in inference stage. . . . . . . . . . . . . 21

2.3 Different methods for obtaining multiple-scale . (a) Input multiple

resolution images. (b) Using different size of convolutional filters

(c) concatenation of different resolutions of feature maps. (d) con-

catenation of feature maps from different layers, each of which has

multiple convolutional filters. . . . . . . . . . . . . . . . . . . . . 24

2.4 An illustration of . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Bird part detection results with occlusion,viewpoint, clustered

background, and pose from the test set. . . . . . . . . . . . . . . . 28

2.6 Pose estimation results with occlusion, crowding, deformation, and

low resolution from the COCO test set. . . . . . . . . . . . . . . . 32

v

LIST OF FIGURES

3.1 Illustration of the research problem. The source domain contains

part annotations, while parts are not annotated in the target do-

main. Also, the target domain contains species which do not exist

in the source domain. . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 The proposed architecture consist of three components: a feature

extractor (yellow), a part classifier, and a domain classifier (blue).

All these components share computation in a feed-forward pass.

The feature extractor outputs feature representation as the input

of the other components. The part classifier is designed to find

the part location, while domain classifier is added to handle the

domain shift between source and target domain. Note that the

backpropagation gradients that pass from domain classifier to the

feature extractor are multiplied by a negative constant during the

backpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Overview of the proposed approach. We propose to classify fine-

grained categories by modeling the subtle difference from specific

object parts. Beyond classification results, the proposed DPS-CNN

architecture also offers human-understandable instructions on how

to classify highly similar object categories explicitly. . . . . . . . . 49

4.2 Illustration of the localization network. (a). Suppose a certain

layer outputs feature maps with size 3x3, and the corresponding

receptive fields are shown by dashed box. In this paper, we rep-

resent the center of each receptive filed with a feature vector at

the corresponding position. (b). The first column is the input

image. In the second image, each black dot is a candidate point

which indicates the center of a receptive field. The final stage is to

determine if a candidate point is a particular part or not. . . . . 54

vi

LIST OF FIGURES

4.3 The network architecture of the proposed Part-Stacked CNNmodel.

The model consists of 1) a fully convolutional network for part

landmark localization; 2) a part stream where multiple parts share

the same feature extraction procedure, while being separated by

a novel part crop layer given detected part locations; 3) an ob-

ject stream with lower spatial resolution input images to capture

bounding-box level supervision; and 4) three fully connected layers

to achieve the final classification results based on a concatenated

feature map containing information from all parts and the bound-

ing box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Demonstration of the localization network. The training process

is denoted inside the dashed box. For inference, a Gaussian kernel

is then introduced to remove noise. The results are M 2D part

locations in the 27× 27 conv5 feature map. . . . . . . . . . . . . 58

4.5 Demonstration of the localization network. Training process is

denoted inside the dashed box. For inference, a Gaussian kernel

is then introduced to remove noise. The results are M 2D part

locations in the 27× 27 conv5 feature map. . . . . . . . . . . . . 62

4.6 Network architecture of the proposed Deeper Part-Stacked CNN.

The model consists of: (1) a fully convolutional network for part

landmark localization; (2) a part stream where multiple parts share

the same feature extraction procedure, while being separated by a

novel part crop layer given detected part locations; (3) an object

stream to capture global information; and (4) Feature fusion layer

with input feature vectors from part stream and object stream to

achieve the final feature representation. . . . . . . . . . . . . . . . 65

4.7 Different strategies for feature fusion which are illustrated in (a)

Fully connected,(b) Scale Sum, (c) Scale Max and (d) Scale Aver-

age Max respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.8 Typical localization results on CUB-200-2011 test set. We show 6

of the 15 detected parts here. They are: beak (red), belly (green),

crown (blue), right eye (yellow), right leg (magenta), tail (cyan).

Better viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

LIST OF FIGURES

4.9 Typical localization results on CUB-200-2011 test set. Better viewed

in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.10 Feature maps visualization of Inception-4a layer. Each example

image is followed by three rows of top six scoring feature maps,

which are from the part stream, object stream and and baseline

BN-inception network respectively. Red dash box indicates a fail-

ure case of visualization using the model learned by our approach. 78

4.11 Example of the prediction manual generated by the proposed ap-

proach. Given a test image, the system reports its predicted class

label with some typical exemplar images. Part-based comparison

criteria between the predicted class and its most similar classes

are shown in the right part of the image. The number in brackets

shows the confidence of classifying two categories by introducing

a specific part. We present top three object parts for each pair

of comparison. For each of the parts, three part-center-cropped

patches are shown for the predicted class (upper rows) and the

compared class (lower rows) respectively. . . . . . . . . . . . . . . 86

viii

List of Tables

2.1 Comparison with methods that report per-part PCK(%) and aver-

age PCK(%) on CUB200-2011. The abbreviated part names from

left to right are: Back, Beak, Belly, Breast, Crown, Forehead,Left

Eye,Left Leg, Left Wing, Nape, Right Eye, Right Leg, Right Wing,

Tail, and Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Comparison of PCP(%) and over-all PCP(%) on CUB200-2011.

The abbreviated part names from left to right are: Back, Beak,

Belly, Breast, Crown, Forehead, Eye, Leg, Wing, Nape,Tail, and

Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Performance comparison between using strict supervision only and

hierarchical supervision. . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Results on COCO keypoint on test-dev and test-standard split . . 30

3.1 Part transferring results for different splits of CUB200-2011 dataset.

Per-part PCKs(%) and mean PCK(%) are given.The abbreviated

part names from left to right are: Back, Beak, Belly, Breast,

Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, Right Eye,

Right Leg, Right Wing, Tail, and Throat . . . . . . . . . . . . . . 44

3.2 Part transferring from CUB200-2011(Source) to BirdSnap(Target).

Per-part PCKs(%) and mean PCK(%) are given. . . . . . . . . . 45

4.1 APK for each object part in the CUB-200-2011 test set in descend-

ing order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

ix

LIST OF TABLES

4.2 Comparison of different model architectures on localization results.

“conv5” stands for the first 5 convolutional layers in CaffeNet;

“conv6(256)” stands for the additional 1 × 1 convolutional layer

with 256 output channels; “cls” denotes the classification layer

with M + 1 output channels; “gaussian” represents a Gaussian

kernel for smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 The effect of increasing the number of object parts on the classifi-

cation accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 The effect of increasing the number of object parts on the classifi-

cation accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5 Comparison with state-of-the-art methods on the CUB-200-2011

dataset. To conduct fair comparisons, for all the methods using

deep features, we report their results on the standard seven-layer

architecture (mostly ALexNet except VGG-m for [52]) if possible.

Note that our method achieves comparable results with state-of-

the-art while running in real-time. . . . . . . . . . . . . . . . . . . 74

4.6 Receptive field size of different layers. . . . . . . . . . . . . . . . . 76

4.7 Comparison of per-part PCK(%) and over-all APK(%) on CUB200-

2011. The abbreviated part names from left to right are: Back,

Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing,

Nape, Right Eye, Right Leg, Right Wing,Tail, and Throat . . . . 76

4.8 Localization recall of candidate points selected by inception-4a

layer with different α values. The abbreviated part names from



Tail, and Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.9 Localization recall of candidate points selected by inception-4a

layer with different α values. The abbreviated part names from



Tail, and Throat . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.10 Comparison of different settings of our approach on CUB200-2011 . 80

x

LIST OF TABLES

4.11 Comparison with state-of-the-art methods on the CUB-200-2011

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xi

Chapter 1

Introduction

1.1 Objectives and Motivation

Keypoint localization refers to the task of finding points of interest in an image.

These points can be divided into the feature keypoints and the semantic keypoints

according to their intended use in visual applications. The feature keypoints

mainly used as reference points to outline an object. A typical example of this is

facial landmarks localization [45,128,131], where landmarks are used to represent

facial features and geometry, such as points on the contours of eyebrows, eyes,

nose, lips. While a single feature keypoint is not required to be semantically

meaningful, each semantic keypoint has a particular meaning for the observed

object. For example, keypoints are defined as human body joints (e.g., wrist,

ankle, hip) or bird parts (e.g., belly, wing, tail). This kind of keypoint carries

valuable information for object recognition, object detection, and pose estimation.

In this thesis, we focus on the problem of localizing the semantic keypoints.

Considerable efforts have been devoted to developing a strong part detector

together with a spatial model for keypoint localization. While early methods

focus on designing handcrafted feature or developing a graphical model to model

spatial constraints [56,73,75,116], recent deep-learning-based methods have taken

the place of handcrafted features and explicit spatial models with only represen-

tation learning [35, 87, 92]. However, these methods usually supervise keypoint

detectors using a confidence map generated from ground-truth keypoint locations.

1

Furthermore, the maximum achievable localization accuracy differs from keypoint

to keypoint, because it is determined by the underlying keypoint structures. For

example, the keypoints with distinctive appearances, such as the shoulders and

head, can be easily detected with high accuracy, while the keypoints with am-

biguous appearance such as an occluded ankle, have much lower localization ac-

curacies. Thus, the keypoint detector often fails to detect ambiguous keypoints if

trained with strict supervision, that is, permitting only a small localization error.

Training with looser supervision could help detect the ambiguous keypoints, but

this comes at a cost to localization accuracy for those keypoints with distinctive

appearances. In this thesis, we propose hierarchically supervised nets (HSNs), a

method that imposes hierarchical supervision within deep convolutional neural

networks (CNNs) for keypoint localization. To achieve this, we firstly propose a

fully convolutional Inception network [92] with several branches of varying depths

to obtain hierarchical feature representations. Then, we build a coarse part de-

tector on top of each branch of features and a fine part detector which takes

features from all the branches as the input.

Also, For the task of keypoint localization, collecting image data with key-

point annotations is harder than with image labels. One may collect images from

Flickr or Google images by searching keywords and then perform refinement pro-

cesses to build a classification dataset, while keypoint annotation requires human

to click the rough location of the keypoint for each image. Considering the prob-

lem of insufficient part annotations, we aim to design a part detector which can

be trained on data without part annotation. We achieve this by combining deep

representation learning and domain adaptation within the same training process.

To learn feature representations that are discriminative to object parts but in-

variant to the domain shift, we train the network by minimizing the loss of the

part classifier and maximizing the loss of domain classifier. The former enforces

the network to learn discriminative features, while the latter encourages learning

features invariant to the change of domain.

It is also worth nothing that the technique of part localization has been used

to boost the performance in many tasks including object detection [24, 53, 130]

and recognition [2, 82], especially for fine-grained categorization, where subtle

differences between fine-grained categories mostly reside in the unique properties

2

of object parts [6, 16, 62, 78, 120, 126]. Therefore, we explore the application of

keypoint localization in the task of fine-grained visual categorization. We do this

by learning a new part-based CNN that models multiple object parts in a unified

framework. The proposed method consists of a localization module to detect

object parts (where pathway) and a classification module to classify fine-grained

categories at the subordinate level (what pathway).

1.2 Problems and Challenges

1.2.1 Keypoints Localization

Keypoints localization is generally formulated as probabilistic problem of esti-

mating the posterior distribution p(x|z), where x is the representation of the

keypoints and z is the image features. Therefore, the primary research in key-

points localization can be divided into three categories:

• The models for the representation of the keypoints - x

• The methods for feature extraction and encoding from images - z

• The inference approaches to estimate the posterior - p(x|z)

Keypoint Representation. There are many ways to represent the keypoints

by considering the structural dependencies among them. The most simple way

is to parameterize the keypoints by their spatial locations. For example, x =

{p1, p2, ..., pN}. However, this representation is variant to the morphology of a

given individual. To obtain invariant representation, many methods [3, 4, 72, 73]

encode the keypoints as a kinematic tree, x = {τ, θτ , θ1, θ2, ..., θN}, where τ is the

root node, θτ is the orientation of the root node, {θi}Ni=1 represents the orientations

of other keypoints with respect to the root node. Alternatively, Non-tree models

have been introduced to model the keypoints as a set of parts, x = {x1, x2, ..., xN},where each part encodes informations including spatial position, orientation, and

scaling, i.e., xi = {τi, θi, si}.Image Features. Image feature extraction is an indispensable component of the

keypoint localization system. Over the years, many hand-crafted features such

3

as SIFT [59] or HoG [19] has been used to model the salient parts of the image.

In recent years, deep feature representation has been widely used to boost the

performance of parts/joints detection to a new level. Toshev et al. [97] propose

a cascade CNN for keypoint regression. in [95], multiple sizes of filter kernels

are used to simultaneously capture features across scales. Similar to this, [67]

upsamples the feature maps of lower layers and stacks them with that of higher

layers.

Inference. There are many methods proposed to characterize the posterior

distribution for the inference stage. We can divided these methods into three

groups: discriminative models, generative models, and part-based models. Dis-

criminative methods have been demonstrated be very effective for pose estima-

tion [1, 10, 42, 66, 83, 85, 101]. This class of methods learn the parameters of the

conditional distribution p(x|z) from the given training data. For example, the

simplest method, linear regression [1] first assumes that the body configuration

x is represented by a linear combination of the image features, z, with additive

Gaussian noise, that is,

x = A[z − μz] + μx + υ, (1.1)

where υ ∼ N(0,Σ),μx = 1N

∑Ni=1 xi and μz = 1

N

∑Ni=1 zi. Then the conditional

distribution is obtained by:

p(x|z) = N([z − μz] + μx,Σ). (1.2)

Alternatively, the posterior distribution is usually expressed as a product of

a likelihood and a prior in the category of generative models, that is:

p(x|z) ∝ p(z|x)p(x). (1.3)

Most methods in this group adopt the maximum a posteriori probability (MAP)

method to search for the most probable configurations with high prior probability

and likelihood:

xMAP = argmax p(x|z) (1.4)

4

This class of methods has not been widely used for pose estimation because of

the high inference complexity. Therefore, part-based models have been introduced

to reduce the search space by representing a pose as a set of parts with connection

constraints. For instance, The body configuration can be represented a Markov

Random Field (MRF), in which body parts are considered as nodes and potential

functions are used to encode the spatial dependencies between parts. Thus, the

posterior, p(x|z) is given as:

p(x|z) ∝ p(z|x)p(x)= p(z|x1, c2, ..., xM)p(x1, c2, ..., xM)

=M∏i=1

p(z|x1)p(x1)∏

(i,j)∈Ep(xi, xj)

(1.5)

In such case, many message-passing methods, such as Belief Propagation (BP),

are used to solve the inference problem efficiently.

1.2.2 Human Pose Estimation

The task of human pose estimation aims to recover the body configuration from

image features. As shown in Figure 1.2, the key step for this task is to localize

the body joints, with which we can depict the limbs and understand a person’s

posture in images. Human pose estimation is a very active research topic in

computer vision because many real-world applications can benefit tremendously

from such a technology. For instance, human pose estimation can be used to

analyze human behaviors in smart surveillance systems, to help health-care robots

in detecting fallen people, to develop animation in making movies, and to interact

with computers in playing game, even many driver assisting systems utilize this

technique to monitor the drivers’ pose for safety driving.

Despite the exhaustive research, pose estimation remains a challenging task

in computer vision mainly due to following reasons (see Figure 1.1):

• extremely deformable body.

• self-occlusion, where body parts occlude each other.

5

eyes ears nose

shoulders

hips knees

ankles elbows wrists

Figure 1.1: Illustrating the pose estimation problem.

6

Figure 1.2: Illustrating the challenges of human pose estimation.

7

Figure 1.3: Illustrating the bird part localizatoin problem.

• highly variable appearance due to clothing, lighting, body size, shape, etc.

• pose ambiguities due to blur, background clutter, apparent similarity of

parts, loose clothing, etc.

• crowding

1.2.3 Bird Part Localization

Part localization models has achieved tremendous success on object detection

[24,53,130] and recognition [2,82] on many occasions. In particular, part models

play a remarkable role in fine-grained categorization (e.g., birds [23, 37, 104, 119,

121, 124], dogs [55, 71], butterflies [108], etc.), since part usually contains subtle

differences used as the main clues to distinguish fine-grained objects. In this

8

thesis, we use birds as the test case with the goal of localizing the parts across

species (see Figure 1.3. Though remarkable progresses has been made to bird

part localization, this task remains a challenge problem. Major difficulties in

detecting bird parts include:

• the extreme variations in pose (e.g., walking, perching, flying, swimming,

etc.)

• large variations in appearance across species.

• part ambiguities due to some parts approximate to each other.

• background cluster

1.3 Convolutional Neural Network

Convolutional neural networks (CNNs, or ConvNets) are a biologically-inspired

variation of traditional multilayer perceptrons (MLPs). Different to MLPs, CNNs

share weights of the connections between neurons. This sharing strategy can sig-

nificantly reduce the number of trainable parameters hence increase the learning

efficiency. The canonical CNN architecture developed by yann LeCun [49] is

first designed to recognize visual pattern from images in 1997. However, CNNs

were not widespread until 2012, when Krizhevsky et al. [47] achieve remarkable

performance on the ImageNet 2012 classification benchmark with CNNs. Since

then, CNNs have been successfully applied to various of applications in computer

vision. Meanwhile, recent works on more advanced and deeper architecture such

as VGG [87], Inception [38, 91–93], and ResNet [35] further foster research on

convolutional neural networks.

A convolutional neural network normally consists of three types of layers which

are convolutional, pooling, and fully connected layers. The convolutional layer

aims to detect important patterns from the previous layer, while the pooling layer

is acting like filtering or merging the patterns to obtain more robust features.The

fully connected layer is generally used to map the convolutional features to clas-

sification scores.

9

Convolutional layers are the key components in CNNs. Each convolutional

layer consist of a group of learnable filters. These filters are small spatially but

extends through all the channels of the input volume. During the forward pass,

each filter is first convolved with the input volume and produce a corresponding

feature map, then the feature maps stacked along the depth dimension form the

output volume of next layer.

1.4 Fine-grained Visual Categorization

Fine-grained visual categorization (FGVC) refers to the task of identifying ob-

jects from subordinate categories and is now an important subfield in object

recognition. FGVC applications include, for example, recognizing species of

birds [8, 105, 110], pets [44, 71], flowers [5, 68], and cars [61, 89]. Lay individu-

als tend to find it easy to quickly distinguish basic-level categories (e.g., cars or

dogs), but identifying subordinate classes like ”Ringed-billed gull” or ”California

gull” can be difficult, even for bird experts. Tools that aid in this regard would

be of high practical value.

While numerous attempts have been made to boost the classification accuracy

of FGVC [11,16,21,52,107], an important aspect of the problem has yet to be ad-

dressed, namely the ability to generate a human-understandable ”manual” on how

to distinguish fine-grained categories in detail. For example, ecological protection

volunteers would benefit from an algorithm that could not only accurately classify

bird species but also provide brief instructions on how to distinguish very similar

subspecies (a ”Ringed-billed” and ”California gull”, for instance, differ only in

their beak pattern), aided by some intuitive illustrative examples. Existing fine-

grained recognition methods that aim to provide a visual field guide mostly follow

a ”part-based one-vs.-one features” (POOFs) [6–8] routine or employ human-

in-the-loop methods [12, 48, 102]. However, since the amount of available data

requiring interpretation is increasing drastically, a method that simultaneously

implements and interprets FGVC using deep learning methods [47] is now both

possible and advocated.

It is widely acknowledged that the subtle differences between fine-grained cate-

gories mostly reside in the unique properties of object parts [6,16,62,78,120,126].

10

Therefore, a practical solution to interpreting classification results as human-

understandable manuals is to discover classification criteria from object parts.

Some existing fine-grained datasets provide detailed part annotations including

part landmarks and attributes [61, 105]. However, they are usually associated

with a large number of object parts, which incur a heavy computational bur-

den for both part detection and classification. From this perspective, a method

that follows an object part-aware strategy to provide interpretable prediction cri-

teria at minimal computational effort but deals with large numbers of parts is

desirable. In this scenario, independently training a large convolutional neural

network (CNN) for each part and then combining them in a unified framework is

impractical [120].

1.5 Contributions and Thesis Outline

In this thesis, we investigate three questions related to the task of keypoint local-

ization, which are: 1) how to design a good CNN architecture with accuracy and

efficiency for keypoint localization. 2) How to utilize the data without part anno-

tation in training for keypoint localization? 3) How to incorporate the technique

of keypoint localization into the fine-grained categorization system?

1.5.1 Contributions

• We propose the hierarchically supervised nets (HSNs) for keypoint local-

ization, a method that imposes hierarchical supervision within deep convo-

lutional neural networks (CNNs) for keypoint localization. The approach

significantly outperform the state-of-the-art methods on both bird part de-

tection and human pose estimation.

• We present a method that learns deep representation while performing do-

main adaption to address the problem of insufficient annotation data.

• With the technique of keypoint localization, we propose a part-stacked CNN

architecture which achieves state-of-art performance on the CUB200-2011

benchmark dataset.

11

1.5.2 Outline

The outline of the dissertation is as follows:

Chapter 2 presents the idea of using hierarchical supervisor signals within deep

convolutional neural networks (CNNs) for keypoint localization. We introduce the

HSN architecture and describe the details of each component. We also evaluate

the efficacy and generality of our method by conducting experiments on the CUB-

200-2011 bird dataset and the MSCOCO Keypoint dataset.

Chapter 3 focuses on the problem of transferring semantic parts across fine-

grained species. We propose a method that combines part detection and domain

adaptation in the same learning pipeline for keypoint localization. This chapter

first introduce the detailed design of our method. Then, to investigate how many

species of images are sufficient to learn a part detector, we perform a quantitative

evaluation on CUB200-2011. We also evaluate our method on the setting of

transferring parts across datasets.

Chapter 4 explores the effectiveness of using part localization technique in ad-

dressing the problem of fine-grained visual categorization. This chapter presents

two CNN architectures based on the idea of cropping part features for classi-

fication. We also present experimental results and a thorough analysis of the

proposed methods. Specifically, we evaluate the performance from four different

aspects: localization accuracy, classification accuracy, inference efficiency, and

model interpretation.

12

Chapter 2

Hierarchically Supervisided Nets

for Keypoint Localization

In this chapter, we propose hierarchically supervised nets (HSNs), a method

that imposes hierarchical supervision within deep convolutional neural networks

(CNNs) for keypoint localization. Recent CNN-based keypoint localization meth-

ods supervise detectors using a confidence map generated from ground-truth key-

point locations. However, the maximum achievable localization accuracy varies

from keypoint to keypoint, as it is determined by the underlying keypoint struc-

tures. To account for this kind of diversity, we propose to supervise part detec-

tors built on hierarchical features in CNNs using hierarchical supervisor signals.

Specifically, we develop a fully convolutional Inception network composed of sev-

eral branches of coarse detectors, each of which is built on top of a feature layer in

CNNs, and a fine detector built on top of multiple feature layers. These branches

are supervised by a hierarchy of confidence maps with different levels of strictness.

All the branches of detectors are unified principally to produce the final accurate

keypoint locations. We demonstrate the efficacy, efficiency, and generality of our

method on several benchmarks for multiple tasks including bird part localization

and human body pose estimation. Especially, our method achieves 72.2% AP on

the 2016 COCO Keypoints Challenge dataset, which is an 18% improvement over

the winning entry.

14

(a) (c)(b)

Figure 2.1: An illustration of the predicted keypoints from our HSN architecture.The left image contains highly accurate keypoints detected by the fine detectorwith strict supervision, the middle image contains keypoints from coarse detec-tors with loose supervisions, and the right image shows the final predictions byunifying the fine and coarse detectors.

15

2.1 Introduction

Predicting a set of semantic keypoints, such as human body joints or bird parts,

is an essential component of understanding objects in images. For example, key-

points help align objects and reveal their subtle difference that is useful for han-

dling the problems with small inter-class variations such as fine-grained catego-

rization [37,119,124]. Also, the key component of human pose estimation system

is to localize the body joints [74,88,95], with which we can depict the limbs and

understand a person’s posture in images.

Despite dramatic progress over recent years, keypoint prediction remains a

significant challenge due to appearance variations, pose changes, and occlusions.

For instance, the local appearances of bird parts may differ vastly across species

or different poses (e.g . perching, flying, and walking). Localizing keypoints on

the human body must be invariant to appearance changes caused by factors like

clothing and lighting, and robust to large layout changes of parts due to articu-

lations [95]. To tackle these difficulties, early works combined handcrafted part

appearance features and with an associated spatial model to capture both local

and global information [56, 73, 75, 116]. Recently, convolutional neural networks

(CNNs) [35,87,92] have significantly reshaped the conventional pipeline by replac-

ing handcrafted features and explicit spatial models with more powerful learned

hierarchical representations [67, 84, 95, 109]. The hierarchical representations in

CNNs provide us a natural way to implicitly model part appearances and spa-

tial interactions between parts. Thus, considerable effort has been placed into

leveraging hierarchical features in CNNs to build a fine keypoint detector which

is expected to possess high localization accuracy [14,70].

Existing CNN-based keypoint localization methods usually supervise keypoint

detectors using a confidence map generated from ground-truth keypoint locations.

However, the maximum achievable localization accuracy differs from keypoint to

keypoint, because it is determined by the underlying keypoint structures. For

example, the keypoints with distinctive appearances, such as the shoulders and

head, can be easily detected with high accuracy, while the keypoints with am-

biguous appearance such as an occluded ankle, have much lower localization ac-

curacies. Thus, the keypoint detector often fails to detect ambiguous keypoints if

16

trained with strict supervision, that is, permitting only a small localization error.

Training with looser supervision could help detect the ambiguous keypoints, but

this comes at a cost to localization accuracy for those keypoints with distinctive

appearances.

In this chapter, we propose hierarchically supervised nets (HSNs), a method

that imposes hierarchical supervision within deep convolutional neural networks

(CNNs) for keypoint localization. To achieve this, we firstly propose a fully

convolutional Inception network [92] with several branches of varying depths to

obtain hierarchical feature representations. Then, we build a coarse part detector

on top of each branch of features and a fine part detector which takes features

from all the branches as the input.

These detectors have different localization abilities and are complementary to

each other. The shallower coarse detectors can produce accurate localizations of

keypoints with distinctive appearances; however, they often fail to detect key-

points with ambiguous appearances. The deeper branches can infer the approx-

imate locations of ambiguous keypoints but at the cost of reduced localization

accuracy for the unambiguous keypoints. Thus, we supervise these branches of

detectors using a hierarchy of confidence maps with strictness levels that are set

according to the localization abilities of the branches. By supervising the part

detectors built on hierarchical features with hierarchical supervisor signals, our

HSN fully explores the diversities of part structures and the diversities of repre-

sentations in CNNs.

Each HSN branch produces keypoints with various localization accuracies,

which are unified to produce the final keypoint locations. As shown in Figure

4.1, the finally detected keypoints include very accurate ones detected by the

fine detector and approximately accurate ones detected by the coarse detectors.

The proposed HSNs outperforms state-of-the-art approaches by a large margin

on bird part localization and human pose estimation datasets.

Our main contributions include: (a) we present a strategy of using receptive

fields as candidate boxes to facilitate part localization,(b) we obtain multi-scale

feature representation by concatenating feature maps from multi-level layers with

multiple filter sizes. (c) We design a unified approach to combine the prediction

from multiple detectors, and (d) we introduce a novel framework for generality,

17

efficiency, and accuracy. We outperform state-of-the-art approaches by a large

margin on the datasets of bird part localization and human pose estimation.

We achieve 88% PCK0.1 and 71.0% PCK0.05 which are 3% and 12% higher

respectively than the previous best methods on the CUB2011 dataset. We also

achieve 72.2% mAP on the 2016 COCO Keypoints Challenge dataset which is an

18% improvement over the winning entry.

2.2 Related Works

2.2.1 Bird part detection

Bird parts play a remarkable role in fine-grained categorization, especially in

bird species recognition where parts have subtle differences. Early works focused

on developing handcrafted part appearance features (e.g ., HOG [19]) and spatial

location models (e.g . pictorial models [25]) to capture both local and global infor-

mation. For example, the deformable part model (DPM) [24] has been extended

for bird part localization by incorporating strong supervision or segmentation

masks [16, 122]. Chai et al . [16] demonstrated that DPM-based part detection

and foreground segmentation aid each other if the two tasks are performed to-

gether. Liu et al . [54, 56] presented a nonparametric model called exemplar to

impose geometric constraints on the part configuration. However, the constraints

enforced by pictorial structure are sometimes not strong enough to combat noisy

detections. Liu et al . [54, 56] presented a nonparametric model called exemplar

to impose geometric constraints on the part configuration. Liu et al . [54] pre-

sented a nonparametric model called exemplar that enforced pose consistency and

subcategory consistency and transformed the problem of part detection to image

matching. Liu et al . [56] built pair detectors for each part pair from part-pair

representations, combing non-parametric exemplars and parametric regression

models.

More recently, convolutional neural networks (CNNs) based methods have in-

creasingly be used for this task. Inspired by object proposals in object detection,

part-based R-CNN [120] extracts CNN features from bottom-up proposals and

learns whole-object and part detectors with geometric constraints. Following this

18

strategy,Shih et al . [84] employed EdgeBox method [132] for proposal generation

and performed keypoint regression with keypoint visibility confidence. To fur-

ther improve the performance of parts detection, Zhang et al . [119] introduced

K-nearest neighbors proposals generated around bounding boxes with geometric

constraints from the nearest neighbors. These methods significantly outperform

conventional approaches; however, the proposal generation and feature extrac-

tion are computationally expensive. Our approach avoids proposal generation by

adopting the fully convolutional architecture which was originally proposed for

dense prediction tasks like semantic segmentation [57].

2.2.2 Human pose estimation

Classical approaches to articulated pose estimation adopt graphical models to

explicitly model the correlations and dependencies of the body part locations

[3,20,43,72,94,116]. These models can be classified into tree-structured [3,73,90,

94], and non-tree-structured [20, 43] models. Attempts have also been made to

model complex spatial relationships implicitly based on a sequential prediction

framework which learns the inference procedure directly [73, 75].

Again, the advent of deep CNNs have recently contributed to significant im-

provements in feature representation and have significantly improved human pose

estimation [15, 17, 67, 74, 95, 97, 109, 115]. Toshev et al . [97] directly regressed

x, y joint coordinates with a convolutional network, while more recent work re-

gressed images to confidence maps generated from joint locations [15,67,95,109].

Tompson et al . [95] jointly trained a CNN and a graphical model, incorporat-

ing long-range spatial relations to remove outliers on the regressed confidence

maps. Papandreou et al . [70] proposed to use fully convolutional ResNets [35]

to predict a confidence map and an offset map simultaneously and aggregated

them to obtain accurate predictions. Other works adopted a sequential proce-

dure that refined the predicted confidence maps successively using a series of

convolutional modules [15, 67, 109]. Cao et al . [14] proposed a pose estimation

framework which adopts both explicit spatial modeling and implicit sequential

predictions. In contrast to existing approaches, our approach models the part ap-

pearance and spatial relationships using a single network with several branches to

19

capture multi-scale information, which is more efficient because it requires no ex-

plicit graphical model-style inference or sequential refinement. Also, we generate

the confidence maps used for supervision according to the localization capability

of each branch.

2.3 Hierarchically Supervised Nets

In this section, we introduce the HSN architecture and describe the details of

each component. We cast keypoint localization as a part detection problem, in

which a subset of parts from a set of candidate regions is selected and labeled

with part classes, e.g . “shoulder,” “ankle,” and “knee.” As illustrated in Figure

3.2, the proposed framework consists of shared base convolutional layers and two

streams of part detectors. The coarse stream consists of three coarse detectors

branches, each of which only inputs features within a specific scale range induced

by the Inception modules. The main difference in these branches is the number

of stacked inception modules, leading to different receptive field sizes. Smaller

receptive fields focus more on capturing local appearances, while larger ones are

more suitable for modeling the spatial dependencies between parts. Therefore we

concatenate feature maps from all the coarse detectors to learn a fine detector

that is expected to provide very accurate localizations. Finally, we learn the

entire network using hierarchical confidence maps, each of them has a strictness

level varying with the localization ability of the corresponding detector.

2.3.1 Network Architecture

Our detection network simultaneously predicts multiple part locations from the

input image. We implement this by following the “recognition using regions”

paradigm [34], which is widely used in object detection [77]. We predefine a set

of square boxes as part candidate regions to perform part localization and feature

extraction concurrently in a network forward pass.

Stride, receptive fields, and depth. We build the detector based on Inception-

v2 [92], a deep neural network architecture that has achieved impressive perfor-

mance in object recognition. In a convolutional network, the stride and receptive

20

Coarse Stream

Fine Stream

softmax

softmax

softmax

max pool

inception(3c)

conv

inception(4a)inception(4a)

inception(4b)

inception(4a)

inception(4b)

inception(4c)

GGaussian Kernel

convconv

….

….

28x28xC

….

….

28x28xC

conv

….

….

28x28xC

conv

….

….

28x28xC

loss(4c)

loss(4b)

loss(4a)

loss softm

ax

G

G

G

G

Figure 2.2: Network architecture of the hierarchically supervised nets. The coarse stream learns three coarse detectorsusing hierarchical supervisions and while the fine stream learns a fine detector via strict supervision. Then the coarsepredictions and fine predictions are unified for final prediction in inference stage.

21

field sizes increase with depth. Thus, deeper layers encode richer contextual

information to disambiguate different parts at the cost of reduced localization

accuracy. To balance part classification and localization accuracy, we employ the

features in the Inception (4a-4c) layers to train the three coarse detectors. The

stride of the Inception (4a-4c) layers is 16, and the corresponding receptive field

sizes are 107× 107, 139× 139, and 171× 171, respectively. Given an input image

of size 224x224, the receptive-field size in deeper layer is too large for a part and

may lead to ambiguous detections for closely positioned parts. Thus we increase

the input resolution of the network to 448× 448 so that the receptive field sizes

are appropriate to enclose candidate part regions.

Candidate part regions. To avoid a sliding-window search for possible part

locations, we propose to first identify candidate part regions as done in object

detection. In object detection, the candidate object regions are obtained by

generating region proposals of various sizes and aspect ratios. However, keypoint

localization only aims to infer the central location of parts so does not require a

bounding-box, which bounds the parts tightly. Thus, we define the part regions

as squared regions centered at the ground truth locations, thus removing the need

to generate region proposals. Put another way, we assume that all parts have the

same bounding box size and use the regions enclosed by receptive fields (RFs) at

all positions in the feature map as candidate regions. For example, the size of the

Inception (4a) feature map is 28× 28, which means that there are 784 candidate

regions of size 107× 107, which are uniformly spaced on the input image.

Feature representation. Using regions enclosed by receptive fields as candidate

part regions simplifies the feature extraction for part detectors. In the proposed

fully convolutional network, the cross-channel vector at a spatial position in the

feature map is used as a feature for the candidate part region associated with

that position. This strategy is efficient as it does not require RoI pooling from

bounding-box features as done in object detection. Also, the fine detector relies

on multi-scale representations by fusing multiple feature layers each of which is

processed by multiple filter sizes through Inception modules. Therefore, the fine

detector in our network can model the appearance of the object parts by features

from a large number of scales. As shown in Figure 2.3, there have been three

popular types of methods to obtain multi-scale representations. The first type of

22

methods (Figure 2.3 (a)) resize the images to multiple resolutions and extract the

pyramid features. The second type of methods, as illustrated in Figure 2.3 (b),

adopt different sizes of convolutional filters. For example, GoogLeNet [92] learns

multiple filters (such as 1 × 1, 3 × 3, and 5 × 5) and concatenates their feature

maps. The last type upsample the feature maps from higher layers to fit the size

of intermediate feature maps, and then all feature maps from different layers can

be concatenated to form the multi-scale representation. In contrast, our method

(Figure 2.3 (d)) obtains multi-scale representations by fusing multiple layers each

of which is processed by multiple filter sizes. Specifically, we stack feature maps

from consecutive Inception layers with no downsampling, which allows concate-

nating features from different layers without using upsampling techniques such

as the deconvolutional network. Therefore, the fine detector in our network can

model the appearance of the object parts by features from a large number of

scales.

Hierarchical supervisions. To fully explore the diversities of hierarchical rep-

resentations in CNNs, we simultaneously learn all detectors using the hierarchical

supervisions. As shown in Figure 2.3, each detector has its own appropriate su-

pervision generated according to receptive field size. Specifically, we generate con-

fidence maps for a detector by calculating the intersections between the candidate

part regions and the ground truth part regions. Let Kc = {1, . . . , K} be the set of

part classes, and D denote the number of coarse detector branches. Given an out-

put feature map in the d-th branch with sizeW×H, stride s, offset padding p, and

receptive field size r, each location (w, h) in the output feature map corresponds

to a receptive field rf(w, h) centered at position (w∗, h∗) = (w, h)∗s−(p−1)+r/2

in the input image. For an annotated keypoint location (i, j) with class k ∈ Kc,

we define a ground truth region gtk(i, j) with size r× r centered at (i, j). To con-

struct a target response map Y d for the d-th detector branch, we set Y d(w, h) = k

if the candidate region rf(w, h) has an Intersection-over-Union (IoU) higher than

0.5 with the ground truth region gtk(i, j) and set Y d(w, h) = 0 to classify it as

the backgound otherwise. For the fine detector, we generate a strict supervision

map by setting Y f (w, h) = k if ‖ (w∗, h∗)− (ik, jk) ‖2≤ 0.05 ∗ ref lengh and set

Y f (w, h) = 0 otherwise, where ref lengh is the longer side of the object bounding

box. The confidence map hierarchy generated for the detector branches enable

23

(a) (b)

(d)(c)

Figure 2.3: Different methods for obtaining multiple-scale . (a) Input multipleresolution images. (b) Using different size of convolutional filters (c) concatena-tion of different resolutions of feature maps. (d) concatenation of feature mapsfrom different layers, each of which has multiple convolutional filters.

24

Table 2.1: Comparison with methods that report per-part PCK(%) and averagePCK(%) on CUB200-2011. The abbreviated part names from left to right are:Back, Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape,Right Eye, Right Leg, Right Wing, Tail, and Throat

α Methods Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Mean

0.1[124] 85.6 94.9 81.9 84.5 94.8 96.0 95.7 64.6 67.8 90.7 93.8 64.9 69.3 74.7 94.5 83.6Ours 88.3 94.5 87.3 91.0 93.0 92.7 93.7 76.9 80.5 93.2 94.0 81.2 79.2 79.7 95.1 88.0

0.05[124] 46.8 62.5 40.7 45.1 59.8 63.7 66.3 33.7 31.7 54.3 63.8 36.2 33.3 39.6 56.9 49.0[118] 66.4 49.2 56.4 60.4 61.0 60.0 66.9 32.3 35.8 53.1 66.3 35.0 37.1 40.9 65.9 52.4Ours 64.1 87.9 57.9 65.8 80.9 83.9 90.3 58.0 50.9 79.4 89.6 62.6 51.0 57.9 84.9 70.9

0.02[124] 9.4 12.7 8.2 9.8 12.2 13.2 11.3 7.8 6.7 11.5 12.5 7.3 6.2 8.2 11.8 9.9[118] 18.6 11.5 13.4 14.8 15.3 14.1 20.2 6.4 8.5 12.3 18.4 7.2 8.5 8.6 17.9 13.0Ours 19.6 40.7 15.7 19.0 33.1 36.0 47.8 20.1 13.1 28.9 47.1 20.9 14.4 18.3 34.1 27.3

Table 2.2: Comparison of PCP(%) and over-all PCP(%) on CUB200-2011. Theabbreviated part names from left to right are: Back, Beak, Belly, Breast, Crown,Forehead, Eye, Leg, Wing, Nape,Tail, and Throat

Methods Ba Bk Be Br Cr Fh Ey Le Wi Na Ta Th Total

[54] 62.1 49.0 69.0 67.0 72.9 58.5 55.7 40.7 71.6 70.8 40.2 70.8 59.7[56] 64.5 61.2 71.7 70.5 76.8 72.0 70.0 45.0 74.4 79.3 46.2 80.0 66.7[84] 74.9 51.8 81.8 77.8 77.7 67.5 61.3 52.9 81.3 76.1 59.2 78.7 69.1

Ours(final) 82.2 57.4 81.3 80.3 75.6 63.0 62.5 70.8 70.8 81.1 59.7 73.5 72.1

detection of keypoints at various localization accuracy levels.

2.3.2 Learning and Inference

We build diversified part detectors using fully convolutional architectures with

different depths and supervisions. For efficient inference, we simultaneously learn

all the detection networks with shared base convolutional layers by minimizing a

multi-task loss.

Learning. Let σd = ϕ(X,W,Φd,Φdcls) be the last feature maps of size W×H×C

in the d-th detector branch given input image X, shared weights W , unshared

weights Φd in the feature layers, and unshared weights Φdcls in the classifier layer,

respectively. We add one more channels to model the background class and

thereby C = (|Kc| + 1). We use the hierarchical confidence maps described in

Figure 2.3 as supervisions. Here, we compute the prediction score at the position

25

Figure 2.4: An illustration of

26

(w, h, k) in the last feature maps using the softmax function:

Prod(w,h,k) =exp(σd

(w,h,k))∑k∈{0,...,Kc}

exp(σd(w,h,k))

. (2.1)

Therefore, the loss function on a training image for each branch is defined as

bellow:

�(X,W,Φd,Φdcls, Y

d) =

−1

W ×H

W−1∑w=0

H−1∑h=0

|Kc|∑k=0

1{Y d(w,h) = k}log(Prod(w,h,k)).

(2.2)

The loss function �(X,W,Φf ,Φfcls, Y

f ) for the fine detector is defined similarly

as Eqn. 2.2. Then we use a multi-task loss to train all the coarse detectors and

the fine detector jointly:

£(Ω, Y ) =D∑

d=1

�(X,W,Φd,Φdcls, Y

d)

+�(X,W,Φf ,Φfcls, Y

f ),

(2.3)

where Ω = {W, {Φd,Φdcls}Dd=1,Φ

fcls}, Φf = {Φd}Dd=1, and Y = {{Y d}Dd=1, Y

f}.Inference. For each detector in the inference stage, we obtain the prediction

scores for all candidate regions through Eqn.3.1. Then we compute the prediction

map Od for each part as follows:

Od(w, h, k∗) =

⎧⎨⎩

1 if argmaxk

Prod(w,h,k) = k∗

0 otherwise.(2.4)

As we use loose supervision for each detector, the results Od have multiple

predicted locations for each part. According to the overlapping receptive field

mechanisms in CNNs, the most precise prediction is around the center of the

predicted locations. Therefore, we obtain a “blur” prediction by convolving the

prediction maps with a 2D Gaussian kernel G and select the location with the

maximum value in the k-th channel as the unique prediction (w∗k, h

∗k) for the k-

27

Ground Truth

Our Predictions

Ground Truth

Our Predictions

Figure 2.5: Bird part detection results with occlusion,viewpoint, clustered back-ground, and pose from the test set.

28

th part. Meanwhile, considering some object parts may be invisible, we set a

threshold θ that controls if the predicted location is a part or background pixel.

Let g(:, :, k) = Od(:, :, k∗) ∗ G, the inferred part locations are given as:

(w∗k, h

∗k) =

⎧⎨⎩

argmaxw,h

g(w, h, k) if Prod(w∗,h∗,k) > θ,

(−1,−1) otherwise.(2.5)

where Pro(w∗, h∗, k) = Pr(Y d(w∗,h∗) = k|σd

(w∗,h∗))

Unified detection. Our system learns four detectors simultaneously and uni-

fies their outputs into the final prediction. The detectors vary in their ability

to detect the object parts. The fine detector tends to output accurate and reli-

able predictions since it receives stacked features from multiple layers. However,

we observe that it may miss predictions of some occluded parts, which can be

detected by the coarse detectors. To predict a set of parts precisely and as com-

pletely as possible, we combine the outputs from the coarse and fine detectors by

using the strategy that the former ones serve as the assistant predictors for the

latter one. Let (w∗k, h

∗k)

d be the kth part prediction with score Prod(w∗,h∗,k) from

the d-th coarse part detector, and (w∗k, h

∗k)

f be the kth part prediction with score

Prof(w∗,h∗,k) from the fine part detector. Then we obtain the unified detection

using the equation bellow:

(w∗∗k , h∗∗

k ) =

{(w∗

k, h∗k)

f if Pro(w∗, h∗, k)f ≥ μ

(w∗k, h

∗k)

d∗ otherwise,(2.6)

where d∗ = argmaxd

Prod(w∗,h∗,k), μ ∈ [0, 1] is a threshold that controls how much

the coarse and fine detectors contribute to the prediction. If μ = 0, only the fine

detector is used for detection, but when μ = 1, the final output is determined by

the coarse detectors.

2.4 Experiments

To evaluate the efficacy and generality of our method, we conduct experiments

on the CUB-200-2011 bird dataset [105] and the MSCOCO Keypoint dataset [51]

29

Table 2.3: Performance comparison between using strict supervision only andhierarchical supervision.

α Methods 4a(%) 4b(%) 4c(%) Fine(%) Unified(%)

0.1Str-super 66.1 59.6 79.9 80.8 83.7Hier-super 79.2 84.9 82.0 80.8 88.0



Table 2.4: Results on COCO keypoint on test-dev and test-standard split

Method AP AP OKS=0.50 AP OKS=.75 AP medium AP large AR AR oks=.50 AR OKS=.75 AR medium AR largeTest-Dev

CMU-Pose [14] 0.618 0.849 0.675 0.571 0.682 0.665 0.872 0.718 0.606 0.746G-RMI [70] 0.605 0.822 0.662 0.576 0.666 0.662 0.866 0.714 0.619 0.722G-RMI(ext & ens) [70] 0.668 0.863 0.734 0.630 0.733 0.716 0.896 0.776 0.669 0.782DL-61 0.544 0.753 0.509 0.583 0.543 0.708 0.827 0.692 0.753 0.768R4D6 0.514 0.75 0.559 0.474 0.567 0.563 0.77 0.61 0.499 0.649umich vl 0.46 0.746 0.484 0.388 0.556 0.518 0.771 0.546 0.407 0.669belagian 0.419 0.617 0.452 0.3 0.58 0.454 0.63 0.489 0.316 0.639HSNs(ours) 0.726 0.861 0.697 0.783 0.641 0.892 0.944 0.88 0.94 0.872

Test-StdCMU-Pose [14] 0.611 0.844 0.667 0.558 0.684 0.665 0.872 0.718 0.602 0.749G-RMI [70] 0.603 0.813 0.656 0.565 0.674 0.666 0.866 0.717 0.62 0.729G-RMI(ext & ens) [70] 0.658 0.851 0.723 0.629 0.713 0.717 0.895 0.778 0.662 0.792DL-61 0.536 0.756 0.49 0.561 0.542 0.712 0.832 0.694 0.75 0.774R4D6 0.505 0.745 0.554 0.466 0.563 0.563 0.778 0.612 0.499 0.648umich vl 0.438 0.73 0.453 0.364 0.537 0.503 0.762 0.524 0.39 0.652belagian1 0.41 0.607 0.446 0.284 0.576 0.447 0.628 0.485 0.304 0.635HSNs(ours) 0.722 0.857 0.688 0.786 0.637 0.878 0.936 0.865 0.93 0.863

30

. Our approach significantly exceeds the state-of-the-art methods on both two

tasks.

2.4.1 Bird Part Localization

The CUB200-2011 [105] is a widely used dataset for bird part localization. It

contains 200 bird categories and 11, 788 images with roughly 30 training images

per category. Each image has a bounding box and 15 key-point annotations.

To evaluate the localization performance, early approaches [54, 56, 84] mainly

used percentage of correct parts (PCP) measure, in which a correct part location

should be within 1.5 standard deviation of an MTurk workers’ clicks from the

ground truth part locations. Recent methods [118, 118, 124] on this task have

used percentage of correctly localized keypoints (PCK) as the evaluation metric.

According to the PCK criteria used in [124], given an annotated bounding box

with size (w, h), a predicted location is correct if it lies within α ×max(h, w) of

the ground-truth keypoint. Here we adopt both the PCP and PCK criteria and

compare our results to the reported performance of the state-of-the-art methods.

We present the PCP results for each part as well as the total PCP results in Table

2.2. Compared to the methods that report PCP results, our method improves

the overall PCP over the second best approach by about 4.3%. Notably, although

previous methods show poor performance of the ’leg’ and ’back’ part detection,

our method achieves up to 33.8% and 9.8% improvements for the two parts over

the next best method. We also report per-part PCK and mean PCK results

compared with other methods with α ∈ {0.1, 0.05, 0.02} in Table 2.1. Here, a

smaller α means more strict error tolerance in the PCK metric. Our method

outperforms existing techniques at various α setting. This nicely demonstrates

our approach produces more accurate predictions with a higher recall for keypoint

localization. Also, the most striking result is that our approach obtains a 35%

improvement over the second best method using the strict PCK metric. Figure

2.5 shows some results on the CUB200-2011 testing set.

In order to further understand the performance gains provided by our net-

work structure, we also provide intermediate results of using the strict supervi-

sions and the hierarchical supervision. As shown in Table 2.3, using hierarchical

31

Figure 2.6: Pose estimation results with occlusion, crowding, deformation, andlow resolution from the COCO test set.

32

supervisions to learn the parallel convolutional network achieves better perfor-

mance than using the strict supervision alone. This is mainly because imposing

appropriate supervision can significantly improve the accuracy of the coarse de-

tectors, thereby enhance the performance of the unified detection. Moreover, the

performance gain gradually diminishes as α decreases, because coarse detectors

fail to predict very accurate locations and contribute less to the final predictions.

2.4.2 Human Pose Estimation

The MSCOCO Keypoint dataset consists of 100k people with over 1 million total

annotated keypoints for training and 50k people for validation. The testing set is

unreleased and includes ”test-challenge,” ”test-dev,” and ”test-standard” three

subsets, each containing about 20K images. The MSCOCO evaluation defines

the object keypoint similarity (OKS) and use AP(averaged across all 10 OKS

thresholds) as the main metric to evaluate the keypoint performance.

Implementation details. To address the problem of multi-person pose estima-

tion, we adopt the Faster R-CNN framework [77] with a pre-trained model1 on

the MSCOCO dataset to obtain the person bounding boxes. We first crop out

all person instances and resize the long side of each image to 512 pixels while

maintaining its aspect ratio. We pad each resized image with zero pixels and

form a training example of size 512×512. Then we randomly crop the image into

448×448 as the input of the hierarchical supervised nets. We train our model for

300k iterations using SGD with a momentum of 0.9, a batch size of 16, and an

initial learning rate of 0.001 with step decay 100k. We initialize network weights

with a pre-trained model on ImageNet which is available online 2.

Results. We evaluate our methods on the“test-dev” and “test-standard” and

obtain the evaluation results on 10 metrics 3 from the online server 4. We com-

pare our keypoint performance with the results from top teams at the MSCOCO

Keypoint Challenge 2016. As can be seen from the Table 2.4, our performance sig-

nificantly surpasses other methods for most of the ten metrics. Most remarkably,

1https://github.com/rbgirshick/py-faster-rcnn2https://github.com/lim0606/caffe-googlenet-bn3http://mscoco.org/dataset/#keypoints-eval4https://competitions.codalab.org/competitions/12061

33

on the “dev-std” split, we achieve 0.722 AP which is 18% improvement over the

winning team. Furthermore, we achieve comparable results to the method [70] for

metrics APOKS=0.5 and APOKS=0.75. [70] uses extra data and ensemble models,

while our model is trained on provided data only and outperforms this method

by a large margin on the overall AP metric. Notice that the overall AP is the

average AP across all 10 OKS thresholds. Therefore, the significant performance

improvements for the overall AP and AR again demonstrate that our method has

a strong ability to predict accurate keypoint localizations with high recalls. Fig-

ure 2.6 shows some pose estimation results on the MSCOCO testing set. It is also

worth nothing that our caffe [40] implementation of HSN runs at 48 frames/sec

on a TitanX GPU in the inference stage. Our method allows for real-time human

pose estimation together with a fast person detector.

2.5 Conclusion

In this chapter, we have proposed a hierarchical supervised convolutional network

for keypoint localization on birds and humans. Our method fully explores hier-

archical representations in CNNs by constructing a series of part detectors which

are trained using hierarchical supervision. The hierarchical supervision provides

supervision according to the localization ability of the detectors. The outputs of

all the part detectors are unified principally to deliver promising performance for

both bird part localization and human pose estimation. In the future, we will go

on to investigate how to incorporate features to generate hierarchical supervisions

and extend our framework to other challenging tasks.

34

Chapter 3

Transferring Part Locations

Across Fine-grained Categories

Previous chapter shows using hierarchical supervision within deep convolutional

neural network can significantly improve the performance of keypoint localiza-

tion. In this chapter, we focus on the problem of training a part detector using

insufficient annotation data. We address this problem by incorporating the tech-

nique of domain adaptation into deep representation learning. We adopt one of

the coarse detector from HSNs as the baseline and perform a quantitative evalu-

ation on CUB200-2011 and BirdSnap dataset. Interestingly, our method trained

on only 10 species images achieves 61.4% PCK accuracy on the testing set of 190

unseen species.

3.1 Introduction

One of the biggest catalysts for the success of deep learning is the public avail-

ability of massive labeled data. For example, much of the recent progress in

image understanding can be attributed to the presence of large-scale datasets,

such as Imagenet [79] and COCO [51]. Nevertheless, label annotation is a te-

dious and time-consuming process that requires much effort of human, especially

for the keypoint localization task that needs pixel-level annotation. For instance,

the COCO training set for human pose estimation consists of over 100k person

35

instances and over 1 million labeled keypoints (body joints, e.g. eye, shoulder,

and ankle) in total. Recent successes of part-based methods for species recogni-

tion show keypoint annotation have become increasingly important to the task

of fine-grained visual categorization. However, collecting image data with key-

point annotations is harder than with image labels. One may collect images from

Flickr or Google images by searching keywords and then perform refinement pro-

cesses to build a classification dataset, while keypoint annotation requires human

to click the rough location of the keypoint for each image. Also, the local ap-

pearance around the keypoints accounts for the main differences between species.

Therefore, these raise an interesting question: How many species with keypoint

annotation is sufficient?

Recent works address the problem of insufficient annotations using active

learning algorithms that interactively select the most informative samples from

the unlabeled data. These methods have to re-train the model multiple times,

thereby incur a high computational cost. Different from the standard learning

setting in domain adaptation method, [69] propose to use an auto-validation

procedure to perform part transfer learning. This kind of approach first split the

source data into multiple domains to characterize the domain shift and then train

a part detector on these subsets for generability, but did not take full advantage

of the information from the target domain.

In this section, we focus on the problem part transferring across species (As

illustrated in Figure 4.1) and propose a novel method that aims to learn a ”uni-

versal” detector with transferability. Unlike previous works on transferring part

locations that extract fixed feature representations for domain adaptation, we

follow the idea of deep domain adaptation [27] and combines deep representation

learning and domain adaptation within the same training process. We implement

this by imposing a part classifier and a domain classifier on the top of the fully

convolutional neural network(FCN) [57]. To learn feature representations that

are discriminative to object parts but invariant to the domain shift, we train the

network by minimizing the loss of the part classifier and maximizing the loss of

domain classifier. The former enforces the network to learn discriminative fea-

tures, while the latter encourages learning features invariant to the change of

domain.

36

Black Vulture

Magnolia Warbler

Northern Pintail

White-fronted Goose

Black-crested Titmouse

Ring-billed Gull

Western Bluebird

Long-billed Curlew

Blackburnian Warbler

Red-eyed Vireo

Source domain Target domain

Great Egret Northern Cardinal

Figure 3.1: Illustration of the research problem. The source domain contains partannotations, while parts are not annotated in the target domain. Also, the targetdomain contains species which do not exist in the source domain.

37

The main contributions of this section are: 1) we propose a novel method for

transferring part knowledge from unseen species. 2) we also conduct a thorough

analysis to investigate the transferability of models trained on the various number

of species. 3) we provide insights on how many species of images with annotation

may be needed to perform well on unseen species.

3.2 Relate Works

3.2.1 Part Detection.

Recent methods for part detection can be categorized into three groups: strongly

supervised, semi-supervised, and unsupervised. The first directly learn a strong

detector by minimizing the localization error in the training set. Shih et al . [84]

employed EdgeBox method [132] for proposal generation and performed key-

point regression with keypoint visibility confidence. To further improve the per-

formance of parts detection, Zhang et al . [119] introduced K-nearest neighbors

proposals generated around bounding boxes with geometric constraints from the

nearest neighbors. These methods significantly outperform conventional DPM

based approaches [16, 24, 122]. Many methods proposed for pose estimation also

belong to this group. For example, Wei et al . [109] adopted sequential prediction

procedure which refines belief maps from previous stages by incorporating larger

scale information through several training stages. Newell et al . [67] proposed

the hourglass network structure which processes convolutional features across all

layers in a CNN to predict the keypoint locations. However, these methods tend

to overfit the training data and may have generalization difficulty.

The second category explores semi-supervised training regimes to improve the

generalization accuracy of supervised learning approaches. Classic examples in-

cludes leveraging both strongly-supervised deformable part model (DPM) [24]

and a weakly-supervised DPM to facilitate part localization [122] and refining

the part detector using web images [113]. The last group employs the unsuper-

vised scenario to find object parts without the need of any part annotation. Xiao

et al. [111] cluster the channels of last convolutional feature maps into groups,

where responses are strongly related to the part locations. Similarly, Marcel

38

et al. [86] learn a part model using the activation pattern of feature maps but

with constellation constraints. Approaches belong to this group focus on learn-

ing discriminative parts and may fail to address the problem of semantic parts

localization.

3.2.2 Domain Adaptation and Active Learning

Our work also relates to domain adaptation(DA) that learns a classifier from la-

beled data for unseen data by aligning the feature distribution between the source

and target domains. Typical methods used in visual applications comprise learn-

ing feature transformation [26, 30, 33] and parameters adapting [33, 114]. Recent

methods perform domain adaptation in learning deep representation. [99] models

the domain shift in the last layer of convolutional networks. More recently, [27,98]

trains the entire network with an auxiliary classier to learn feature representation

invariant to domain change. This method performs very well in the task of image

recognition. In this chapter, we adapt the similar idea to transfer part localiza-

tion between different domains. The most significant difference between [27] and

our method is: Our method focus on the problem of transferring local knowl-

edge (part locations) by matching feature distributions from different domains,

while [27] address the problem of transferring global knowledge (object label).

Active learning based algorithms aim to select the most informative samples

from the unlabeled data interactively. Therefore, research works in this area focus

on designing different data selection strategies using entropy [96], diversity [36],

and representativeness [41]. Nevertheless, the model need be to retrained with the

selected data after each data selection, which cause high computation in training

the model.

3.3 Our Approach

3.3.1 Model Formulation

As illustrated in Figure 3.2, we implement the proposed method using deep con-

volutional network architecture. The overall architecture consists of three sub-

39

….

….

28x28xK

domain label d

part label p

feature extractor Gd(.; θd) domain classierGd(.; θd)

part classier Gp(.; θp)

loss Lp

loss Ld

∂Lp

∂θp

∂Ld

∂θd

∂Lp

∂θp− λd

∂Ld

∂θd

forwardprop

∂L

∂θbackprop

Figure 3.2: The proposed architecture consist of three components: a featureextractor (yellow), a part classifier, and a domain classifier (blue). All these com-ponents share computation in a feed-forward pass. The feature extractor outputsfeature representation as the input of the other components. The part classifier isdesigned to find the part location, while domain classifier is added to handle thedomain shift between source and target domain. Note that the backpropagationgradients that pass from domain classifier to the feature extractor are multipliedby a negative constant during the backpropagation.

40

networks, which are used for feature extraction, part classification, and domain

classification respectively.

The key idea of our approach is to minimize the localization error in the

source training set while reducing the distribution variance between the source

and target domain.

LetXS andXT be the training set from source and target domain respectively,

Kc = {1, . . . , K} be the set of part classes, and Yd ∈ {0, 1} be the domain label.

Given a input image x, we define Yd = 0 if x ∈ XS or Yd = 1 if x ∈ XT . Given an

output feature map of size W ×H × C, stride s, offset padding o, and receptive

field size r, we now generate the part label map Yp ⊂ W×H by calculating

the intersections between the candidate part regions and the ground truth part

regions. Here each location (w, h) in the output feature map corresponds to a

receptive field rf(w, h) centered at position (w∗, h∗) = (w, h) ∗ s− (o− 1) + r/2

in the input image. We add one more channels to model the background class

and thereby C = (|Kc|+ 1). Then we define a ground truth region gtk(i, j) with

size r × r centered at (i, j) at the annotated keypoint location (i, j) with class

k ∈ Kc. Finally each part label map Yp is generate by setting Yp(w, h) = k if the

candidate region rf(w, h) has an Intersection-over-Union (IoU) higher than 0.5

with the ground truth region gtk(i, j) and setting Yp(w, h) = 0 to classify it as

the background otherwise.

We now define the loss function for the part classifier. We assume the output

feature maps σ from the feature extractor with input image x and parameters

θf , then σ = ϕf (x, θf ), where ϕ is the mapping function. Then we denote the

part classifier mapping ϕp(σ, θp) with parameters θp. Therefore the prediction

score Prop(w,h,k)(x, θf , θp) for kth class at each position (w, h) can be computed as

following.

Prop(w,h,k)(x, θf , θp) =exp(ϕp(σ(w,h,k), θp))∑

k∈{0,Kc}exp(ϕp(σ(w,h,k), θp))

. (3.1)

41

Therefore, the loss function for the part classifier is defined as bellow:

Lp(x, θp, θf , Yp) =

−1

|XS| ×W ×H

∑x∈XS

W−1∑w=0

H−1∑h=0

|Kc|∑k=0

1{Yp(w, h) = k}log(Prop(w,h,k)(x, θf , θp)).(3.2)

Similarly, we define the loss function for the domain classifier. Let ϕd(σ, θd)

be the domain classifier mapping with parameters θd. Then the prediction score

Prod(x, θf , θd) = ϕd(σ, θd), and the loss function is given:

Ld(x, θp, θf , Yd) =

−1

|XS|+ |XT |∑

x∈(XS⋃

XT )

[Yd logProd(x, θf , θd) + (1− Yd) logProd(x, θf , θd)].(3.3)

In the meantime, we expect feature learned from the part classifier is domain-

invariant. That is, we want the feature distribution from the source domain

{ϕf (x, θf )|x ∈ XS} similar to that {ϕf (x, θf )|x ∈ XT} from the target domain.

This can be achieved by learning parameters θf of feature extractor that maxi-

mizes the loss of the domain classifier and the parameters θd that minimize the

loss of the domain classifier [27]. Thus, we formulated the proposed model as

bellow:

E(x, θp, θf , θd, Yp, Yd) = Lp(x, θp, θf , Yp)− λLd(x, θd, θf , Yd), (3.4)

where λ is a positive parameter that controls the trade-off between the discrim-

inative ability and transferability of the learning representation. Higher values

of λ lead to closer feature distribution between source and target domains, but

may harm the performance of part detector. In this chapter, we set the lambda

to 0.95 by empirical tuning.

42

3.3.2 Optimization with Backpropagation

Here, Ld is the loss function that measures the part classification error, while

the Lp is used as a measurement of the classification error for domain label clas-

sification. we adopt the method used in [27] to optimize the object function

Equation(3.4). The saddle point θf ,θd,θp can be defined using the following equa-

tion.

(θf , θp) = argminθf ,θp

E(θp, θf , θd) (3.5)

θd = argmaxθf ,θp

E(θd, θf , θp) (3.6)

Then we can use the gradient decent algorithm to optimize the object function

Equation (3.4) using the saddle point definition from Equations (3.5,3.6):

θf ← θf − μ(∂Lp

∂θp− λd

∂Ld

∂θd), (3.7)

θp ← θp − μ∂Lp

∂θp, (3.8)

θd ← θd − μ∂Ld

∂θd, (3.9)

where μ is the learning rate.

43

Table 3.1: Part transferring results for different splits of CUB200-2011 dataset.Per-part PCKs(%) and mean PCK(%) are given.The abbreviated part namesfrom left to right are: Back, Beak, Belly, Breast, Crown, Forehead,Left Eye,LeftLeg, Left Wing, Nape, Right Eye, Right Leg, Right Wing, Tail, and Throat

Methods Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Mean

Testing on the source domainS(10) 65.4 81.8 59.1 64.4 87.4 81.4 81.7 43.0 45.6 82.6 82.3 43.9 45.5 54.3 85.5 66.9

S(10)+Adap 61.5 83.1 54.7 67.6 85.3 86.1 77.5 53.7 37.3 81.3 86.2 46.3 47.4 60.4 87.1 67.7S(20) 70.9 85.2 69.9 75.9 85.7 82.4 91.8 50.7 50.0 84.9 81.7 57.6 61.0 58.3 90.0 73.1

S(20)+Adap 71.1 84.4 75.3 78.0 85.7 81.2 82.0 55.3 52.3 83.9 85.9 56.1 59.9 61.4 88.9 73.4

S(50) 73.3 85.1 76.7 79.7 85.3 83.6 89.4 59.9 63.1 86.0 87.8 60.9 65.4 66.9 89.7 76.9S(50)+Adap 75.4 84.6 79.8 81.9 84.0 86.7 91.2 59.4 66.7 84.6 89.1 53.2 62.1 70.6 92.8 77.5

S(100) 80.8 86.2 80.4 85.2 88.7 86.5 91.0 71.6 68.0 89.3 88.5 65.2 69.9 73.9 93.1 81.2S(100)+Adap 81.5 86.5 82.2 86.1 88.2 88.5 87.9 69.2 70.6 89.9 88.5 63.9 70.0 72.1 93.4 81.2

Testing on the target domainS(10) 40.0 72.0 50.9 53.5 75.5 68.7 70.3 31.7 32.2 58.7 72.7 30.2 28.2 22.9 71.1 51.9

S(10) +Adap 54.3 81.3 54.2 61.0 84.2 80.6 71.1 43.6 40.6 75.0 82.8 33.3 37.3 37.2 83.7 61.4

S(20) 62.4 81.5 66.9 73.2 81.5 79.4 79.7 44.4 48.9 75.5 79.6 51.2 51.3 40.2 85.6 66.8S(20)+Adap 73.1 84.6 73.3 78.5 84.7 81.5 83.7 53.0 57.4 82.5 83.9 58.2 59.6 54.1 89.6 73.2

S(50) 67.3 84.2 75.5 79.7 84.0 83.5 87.1 68.7 60.7 83.5 87.8 57.0 62.5 55.4 89.9 74.5S(50)+Adap 74.6 85.1 78.3 81.5 84.7 86.2 90.8 56.8 65.5 83.8 87.9 52.5 63.0 60.8 92.7 76.3

S(100) 75.0 85.9 76.7 84.7 86.6 86.2 91.0 66.8 71.8 87.4 88.4 60.6 73.8 66.0 92.3 79.5S(100)+Adap 77.7 85.4 77.4 85.2 87.8 87.4 88.0 62.6 69.5 89.9 89.0 59.0 69.9 64.3 93.5 79.0

3.4 Experiments

3.4.1 Datasets and Setting

Datasets. We evaluate our method on two datasets for fine-grained localization.

(a) CUB200-2011 [105] is a widely used dataset for bird part localization. It

contains 200 bird categories and 11, 788 images with roughly 30 training images

per category. Each image has a bounding box and 15 key-point annotations. (b)

Birdsnap is a larger bird dataset contains 500 bird species with 49,829 images in

total. This dataset also has object bounding box and 11 body part annotation

for each image. To evaluate the localization performance, we use the correctly

localized keypoints (PCK) as the evaluation metric. In PCK, given a ground

truth bounding box of size (w, h), a prediction is selected as true positive if it lies

within αmaxw, h distance from the nearest ground truth keypoint, here α ∈ (0, 1)

controls the error tolerance. In this work, we set α = 0.1 in all the experiments.

Settings. To evaluate the part transferability, we first perform a quantitative

evaluation on the CUB200-2011 dataset. The training dataset is split into source

domain and target domain in four ways, which are 10, 20, 50, 100 species of

44

Table 3.2: Part transferring from CUB200-2011(Source) to BirdSnap(Target).Per-part PCKs(%) and mean PCK(%) are given.

Methods Bk Cr Na Le Re Be Br Ba Ta Lw Rw MeanTesting on the source domain

S(CUB) 87.7 88.1 92.0 92.2 92.1 85.8 88.0 83.7 77.8 78.7 76.2 85.6S(CUB)+Adap 88.5 88.3 92.5 92.2 92.4 86.6 89.0 84.2 79.7 79.2 77.0 86.3

Testing on the target domainS(CUB) 78.8 81.2 83.5 85.1 84.9 53.0 76.3 69.8 45.1 60.2 61.7 71.0

S(CUB)+Adap 78.0 83.0 84.9 85.5 86.1 56.8 77.2 73.4 47.4 62.0 62.3 72.4

images randomly selected from source domains and 190,180,150,100 species of

images used for target domains. The testing set is split in the same way for

performance evaluation. Then we evaluate part transferability across datasets.

Here, the CUB200-2011 dataset with 200 species is used as source domain, and

the Birdsnap with 500 species is used for the target domain.

3.4.2 Results and Analysis

We use the detector trained on source training set only as the baseline. Then we

compare the performance of our methods in different dataset setting with baseline

detector. We can find several facts from Table 3.1. First, performing domain

adaption with target data do not gain a substantial performance improvement on

the source testing set. Also, we observe that there is a significant performance

gain when perform domain adaptation in the setting with small number of species

in the source domain. For example, our method achieves 61.4%, which is an

18.3% improvements over the baseline detector on the setting with ten species

of images used for source domain. It is also worth nothing that, training on

10 species with part labels and 190 species without part annotations obtain a

modest accuracy for part localization on 190 unseen species. This demonstrate

that semantic parts can be learn from a sufficiently-diverse set of classes but with

insufficient part annotations. However, the improvement exhibits a diminishing

when some species used for source domain is sufficiently large, because the feature

distribution between the source and target domains is relatively close in this case.

45

3.5 Conclusions

In this chapter, we focus on the problem of transferring semantic parts across

fine-grained species. We have proposed a deep domain adaptation method for

part detection. Our method combines part detection and domain adaptation in

the same learning pipeline. Then, we have looked at the question: how many

species of images are sufficient to learn a part detector. To answer this question,

we perform a quantitative evaluation on CUB200-2011 and Birdsnap datasets.

Experimental results suggest that a small number of species can be used to learn

a modest detector when training with domain adaptation techniques.

46

Chapter 4

Fine-grained Categorization with

Part Localization

In this chapter, we start to explore how to incorporate the technique of key-

point localization into the fine-grained categorization system. A well-designed

system for fine-grained categorization usually has three contradictory require-

ments: accuracy (the ability to identify objects among subordinate categories);

interpretability (the ability to provide the human-understandable explanation of

recognition system behavior); and efficiency (the speed of the system). To handle

the trade-off between accuracy and interpretability, we propose a ”Part-Stacked

CNN” and a ”Deeper Part-Stacked CNN” architectures armed with interpretabil-

ity. To obtain the information from the part level, we need to know the location

for each part. Hence, we utilize the technique of keypoint localiztion to ob-

tain part locations. Next, we can crop the part features, and then fused them

with the object feature for fine-grained categorization. Therefore, our method

can simultaneously encodes object-level and part-level cues, thereby outperforms

state-of-the-art approaches on Caltech-UCSD Birds-200-2011.

4.1 Introduction

Fine-grained visual categorization (FGVC) refers to the task of identifying ob-

jects from subordinate categories and is now an important subfield in object

48

California Gull Has its Beak mostly different from Ring Billed Gull

Figure 4.1: Overview of the proposed approach. We propose to classify fine-grained categories by modeling the subtle difference from specific object parts.Beyond classification results, the proposed DPS-CNN architecture also offershuman-understandable instructions on how to classify highly similar object cat-egories explicitly.

recognition. FGVC applications include, for example, recognizing species of

birds [8, 105, 110], pets [44, 71], flowers [5, 68], and cars [61, 89]. Lay individu-

als tend to find it easy to quickly distinguish basic-level categories (e.g., cars or

dogs), but identifying subordinate classes like ”Ringed-billed gull” or ”California

gull” can be difficult, even for bird experts. Tools that aid in this regard would

be of high practical value.

This task is made challenging due to the small inter-class variance caused

by subtle differences between subordinaries and the large intra-class variance

caused by negative factors such as differing pose, multiple views, and occlusions.

However, impressive progress [8,46,103,104,112] has been made over the last few

years, and fine-grained recognition techniques are now close to practical use in

various applications such as for wildlife observation and in surveillance systems.

While numerous attempts have been made to boost the classification accuracy

49

of FGVC [11,16,21,52,107], an important aspect of the problem has yet to be ad-

dressed, namely the ability to generate a human-understandable ”manual” on how

to distinguish fine-grained categories in detail. For example, ecological protection

volunteers would benefit from an algorithm that could not only accurately classify

bird species but also provide brief instructions on how to distinguish very similar

subspecies (a ”Ringed-billed” and ”California gull”, for instance, differ only in

their beak pattern, see Figure 4.1), aided by some intuitive illustrative exam-

ples. Existing fine-grained recognition methods that aim to provide a visual field

guide mostly follow a ”part-based one-vs.-one features” (POOFs) [6–8] routine

or employ human-in-the-loop methods [12,48,102]. However, since the amount of

available data requiring interpretation is increasing drastically, a method that si-

multaneously implements and interprets FGVC using deep learning methods [47]

is now both possible and advocated.

It is widely acknowledged that the subtle differences between fine-grained cate-

gories mostly reside in the unique properties of object parts [6,16,62,78,120,126].

Therefore, a practical solution to interpreting classification results as human-

understandable manuals is to discover classification criteria from object parts.

Some existing fine-grained datasets provide detailed part annotations including

part landmarks and attributes [61, 105]. However, they are usually associated

with a large number of object parts, which incur a heavy computational bur-

den for both part detection and classification. From this perspective, a method

that follows an object part-aware strategy to provide interpretable prediction cri-

teria at minimal computational effort but deals with large numbers of parts is

desirable. In this scenario, independently training a large convolutional neural

network (CNN) for each part and then combining them in a unified framework is

impractical [120].

Here we address the fine-grained categorization problem not only regarding

accuracy and efficiency when performing subordinate-level object recognition but

also about the interpretable characteristics of the resulting model. We do this by

learning a new part-based CNN for FGVC that models multiple object parts in a

unified framework with high efficiency. Similar to previous fine-grained recogni-

tion approaches, the proposed method consists of a localization module to detect

object parts (where pathway) and a classification module to classify fine-grained

50

categories at the subordinate level (what pathway). In particular, our key point

localization network structure is composed of a sub-network used in contempo-

rary classification networks (AlexNet [47] and BN-GoogleNet [38]) and a 1x1

convolutional layer followed by a softmax layer to predict evidence of part loca-

tions. The inferred part locations are then fed into the classification network, in

which a two-stream architecture is proposed to analyze images at both the ob-

ject level (global information) and part level (local information). Multiple parts

are then computed via a shared feature extraction route, separated directly on

feature maps using a part cropping layer, concatenated, and then fed into a shal-

lower network for object classification. Except for categorical predictions, our

method also generates interpretable classification instructions based on object

parts. Since the proposed deeper network architecture-based framework employs

a sharing strategy that stacks the computation of multiple parts, we call the

proposed architecture based on Alexnet Part-Stacked CNN (PS-CNN), and the

other one used deeper structure Deeper Part-Stacked CNN (DPS-CNN).

This section makes the following contributions:

1. DPS-CNN is the first efficient framework that not only achieves state-of-

the-art performance on Caltech-UCSD Birds-200-2011 but also allows in-

terpretation;

2. We explore a new paradigm for keypoint localization, which has exceeded

state of the art performance on Birds-200-2011 dataset;

3. The classification network in DPS-CNN follows a two-stream structure that

captures both object level (global) and part level (local) information, in

which a new share-and-divide strategy is presented to compute multiple

object parts. As a result, the proposed architecture is very efficient with

a capacity of 32 frames/sec 1 without sacrificing the fine-grained catego-

rization accuracy. Also, we propose a new strategy called scale mean-max

(SMM) for feature fusion learning.

This work is not a direct extension of state-of-the-art fine-grained classification

models [52, 119, 124, 125] but a significant development regarding the following

1For reference; a single CaffeNet run at 82 frames/sec under the same experimental setting.

51

aspects: Different to [124] who adapts FCN for part localization, we propose a

new paradigm for key point localization that first samples a small number of

representable pixels and then determine their labels via a convolutional layer

followed by a softmax layer; We also propose a new network architecture and

enrich the methodology used in [37]; Further, we introduce a simple but effective

part feature encoding (named Scale Average Max) method in contrast to Bilinear

in [52], Spatially Weighted Fisher Vector in [125], and Part-based Fully Connected

in [125].

The remainder of this chapter is organized as follows. Related works are sum-

marized in Section 4.2, and the proposed architectures including Part-Stacked

CNN (PS-CNN) and Deeper Part-Stacked CNN (DPS-CNN) are described in Sec-

tion 4.3 and Section 4.4. Detailed performance studies and analysis are presented

in Section 4.5, and in Section 4.6 we conclude and propose various applications

of the proposed DPS-CNN architecture.

4.2 Related Work

4.2.1 Keypoint Localization

. Subordinate categories share a fixed number of semantic components defined as

’parts’ or ’key points’ but with subtle differences in these components. Intuitively,

when distinguishing between two subordinate categories, the widely accepted

approach is to align components containing these fine differences. Therefore,

localizing parts or key points plays a crucial role in fine-grained recognition, as

demonstrated in recent works [6, 32, 62,120,123,129].

Seminal works in this area have relied on prior knowledge about the global

shape [18,64,65,81]. For example, the active shape model (ASM) uses a mixture

of Gaussian distributions to model the shape. Although these techniques provide

an effective way to locate facial landmarks, they cannot usually handle a wide

range of differences such as those seen in bird species recognition. The other group

of methods [11,50,54,56,84,118–120] trains a set of keypoint detectors to model

local appearance and then uses a spatial model to capture their dependencies and

has become more popular in recent years. Among them, the part localization

52

method proposed in [50, 84, 119] is most similar to ours. In [84], a convolutional

sub-network is used to predict the bounding box coordinates without using a

region candidate. Although its performance is acceptable because the network is

learned by jointly optimizing the part regression, classification, and alignment,

all parts of the model need to be trained separately. To tackle this problem, [50]

and [119] adopt the similar pipeline of Fast R-CNN [31], in which part region

candidates are generated to learn the part detector. In this work, we discard the

common proposal-generating process and regard all receptive field centers 1 of a

certain intermediate layer as potential candidate key points. This strategy results

in a highly efficient localization network since we take advantage of the natural

properties of CNNs to avoid the process of proposal generation.

Our work is also inspired by and inherited from fully convolutional networks

(FCNs) [57], which produces dense predictions with convolutional networks. How-

ever, our network structure is best regarded as a fast and effective approach to

predict sparse pixels since we only need to determine the class labels of the cen-

ters of the receptive fields of interest. Thus, FCN is more suited to segmentation,

while our framework is designed for sparse keypoint detection. As FCN aims to

predict intermediate feature maps, then upsample them to match the input image

size for pixel-wise prediction. Recent works [109, 124] borrow this idea directly

for keypoint localization. During training, both of these works resize the ground

truths to the size of the output feature maps and then use them to supervise the

network learning, while, during testing, the predicted feature maps are resized to

match the input size to generate the final key point prediction. However, these

methods cannot guarantee accurate position prediction due to the upsampling

process.

4.2.2 Fine-Grained Visual Categorization

. many methods have been developed to classify object categories at the subordi-

nate level. The best performing methods have gained performance improvements

by exploiting the following three aspects: more discriminative features (including

1Here the receptive field means the area of the input image, to which a location in a higherlayer feature map correspond.

53

(b)

(a)

Figure 4.2: Illustration of the localization network. (a). Suppose a certain layeroutputs feature maps with size 3x3, and the corresponding receptive fields areshown by dashed box. In this paper, we represent the center of each receptivefiled with a feature vector at the corresponding position. (b). The first columnis the input image. In the second image, each black dot is a candidate pointwhich indicates the center of a receptive field. The final stage is to determine ifa candidate point is a particular part or not.

54

deep CNNs) for better visual representation [9, 47, 80, 87, 92]; explicit alignment

approaches to eliminate pose displacements [11, 29]; and part-based methods to

examine the impact of object parts [6, 32, 62, 120, 123, 129]. Another approach

has been used to explore human-in-the-loop methods [13, 21, 106] to identify the

most discriminative regions for classifying fine-grained categories. Although such

methods provide direct and important information about how humans perform

fine-grained recognition, they are not scalable due to the need for human inter-

actions during testing. Of these, part-based methods are thought to be most

relevant to fine-grained recognition, in which the subtle differences between fine-

grained categories mostly relate to the unique object part properties.

Some part-based methods [6,120] employ strong annotations including bound-

ing boxes, part landmarks, or attributes from existing fine-grained recognition

datasets [61, 71, 103, 105]. While strong supervision significantly boosts per-

formance, the expensive human labeling process motivates the use of weakly-

supervised fine-grained recognition without manually labeled part annotations,

i.e., discovering object parts in an unsupervised fashion [46,52,86]. Current state-

of-the-art methods for fine-grained recognition include [124] and [52], which both

employ deep feature encoding method, while our methods are largely inherited

from [120], who first detected the location of two object parts and then trained

an individual CNN based on the unique properties of each part. Compared to

part-based R-CNN, the proposed methods are far more efficient for both detec-

tion and classification. As a result, we can use many more object parts than [120],

while still maintaining speed during testing.

Lin et al . [52], argued that manually defined parts were sub-optimal for object

recognition and thus proposed a bilinear model consisting of two streams whose

roles were interchangeable as detectors or features. Although this design exploited

the data-driven approach to improve classification performance possibly, it also

made the resulting model difficult to interpret. In contrast, our methods attempt

to balance the need for classification accuracy and model interpretability in fine-

grained recognition systems.

55

2x resolution Input Image

454x454

Input Image 227x227

ALEXNET Conv+ReLU

+Pool (5 stages)

ALEXNET Conv+ReLU

+Pool (5 stages)

27x27x256

13x13x256 6x6x256

FCN Conv+ReLU+Pool (7 stages)

6x6x32

4096 4096 K

6x6x 32x

(M+8)

F

Pool5

conv5 fmap

M part locations

conv5_1 1x1 conv

reduce dim.

27x27x32

PART CROP

crown

belly

tail

fc6

fc7 fc8

Figure 4.3: The network architecture of the proposed Part-Stacked CNN model.The model consists of 1) a fully convolutional network for part landmark local-ization; 2) a part stream where multiple parts share the same feature extractionprocedure, while being separated by a novel part crop layer given detected part lo-cations; 3) an object stream with lower spatial resolution input images to capturebounding-box level supervision; and 4) three fully connected layers to achieve thefinal classification results based on a concatenated feature map containing infor-mation from all parts and the bounding box.

4.3 Part-Stacked CNN

We present the model architecture of the proposed Part-Stacked CNN in this

section. In accordance with the common framework for fine-grained recognition,

the proposed architecture is decomposed into a Localization Network (Section

4.4.1) and a Classification Network (Section 4.4.2). We adopt CaffeNet [40], a

slightly modified version of the standard seven-layer AlexNet [47] architecture,

as the basic structure of the network; deeper networks could potentially lead to

better recognition accuracy, but may also result in lower efficiency.

A unique design in our architecture is that the message is transferring op-

eration from the localization network to the classification network, i.e. using

detected part locations to perform part-based classification, is conducted directly

on the conv5 output feature maps within the process of data forwarding. It is a

significant difference compared to the standard two-stage pipeline of part-based

56

R-CNN [120] that consecutively localizes object parts and then trains part-specific

CNNs on the detected regions. Based on this design, a set of sharing schemes are

performed to make the proposed PS-CNN fairly efficient for both learning and

inference. Figure 4.3 illustrates the overall network architecture.

4.3.1 Localization Network

The first stage of the proposed architecture is a localization network that aims to

detect the location of object parts. We employ the simplest form of part landmark

annotations, i.e. a 2D key point is annotated at the center of each object part.

Assume that M - the number of object parts labeled in the dataset, is sufficiently

large to offer a complete set of object parts on which fine-grained categories are

usually different from each other. Motivated by recent progress of human pose

estimation [57] and semantic segmentation [95], we adopt a fully convolutional

network (FCN) [63] to generate dense output feature maps for locating object

parts.

We model the part localization process as a multi-class classification problem

on dense output spatial positions. In particular, suppose the output of the last

convolutional layer in the FCN is in the size of h × w × d, where h and w are

spatial dimensions and d is the number of channels. We set d = M + 1. Here M

is the number of object parts, and 1 denotes for an additional channel to model

the background. To generate corresponding ground-truth labels in the form of

feature maps, units indexed by h×w spatial positions are labeled by their nearest

object part; units that are not close to any of the labeled parts (with an overlap

< 0.5 on receptive field) are labeled as background.

A practical problem here is to determine the model depth and the size of

input images for training the FCN. Generally speaking, layers at later stages

carry more discriminative power and thus are more likely to generate promising

localization results; however, their receptive fields are also much larger than those

of previous layers. For example, the receptive field of conv5 layer in CaffeNet has

a size of 163 × 163 compared to the 227 × 227 input image, which is too large

to model an object part. We propose a simple trick to deal with this problem,

57

2x resolution Input Image

454x454

ALEXNET Conv+ReLU

+Pool (5 stages)

27x27x256 27x27x512 27x27x(M+1)

5x5 Gaussian

Kernel

27x27 Max-pooling

M locations TRAINING

conv5 conv6 1x1

conv+ ReLU

1x1 conv

27x27x(M+1) 27x27x(M+1)

conv7 softmax

Figure 4.4: Demonstration of the localization network. The training process isdenoted inside the dashed box. For inference, a Gaussian kernel is then introducedto remove noise. The results are M 2D part locations in the 27×27 conv5 featuremap.

i.e., upsampling the input images so that the fixed-size receptive fields denoting

object parts become relatively smaller compared to the whole object, while still

being able to use layers at later stages to guarantee enough discriminative power.

The localization network in the proposed PS-CNN is illustrated in Figure 4.5.

The input of the FCN is a bounding-box-cropped RGB image, warped and resized

into a fixed size of 454 × 454. The structure of the first five layers is identical

to those in CaffeNet, which leads to a 27 × 27 × 256 output after conv5 layer.

Afterwards, we further introduce a 1 × 1 convolutional layer with 512 output

channels as conv6, and another 1 × 1 convolutional layer with M + 1 outputs

termed conv7 to perform classification. By adopting a spatial preserving softmax

that normalizes predictions at each spatial location of the feature map, the final

loss function is a sum of softmax loss at all 27× 27 positions.

4.3.2 Classification network

The second stage of the proposed PS-CNN is a classification network with the in-

ferred part locations given as an input. It follows a two-stream architecture with

a Part Stream and a Object Stream to capture semantics from multiple levels.

A sub-network consisting of three fully connected layers is then performed as an

object classifier, as shown in Figure 4.3.

58

Part stream. The part stream acts as the core of the proposed PS-CNN ar-

chitecture. To capture object-part-dependent differences between fine-grained

categories, one can train a set of part CNNs, each one of which conducts classi-

fication on a part separately, as proposed by Zhang et al . [120]. Although such

method worked well for [120] who only employed two object parts, we argue that

it is not applicable when the number of object parts is much larger in our case,

because of the high time and space complexity.

In PS-CNN, we introduce two strategies to improve the efficiency of the part

stream. The first one is model parameter sharing. Specifically, model parameters

of the first five convolutional layers are shared among all object parts, which can

be regarded as a generic part-level feature extractor. This strategy leads to less

parameters in the proposed architecture and thus reduces the risk of overfitting.

Other than model parameter sharing, we also conduct a computational sharing

strategy. The goal is to make sure that the feature extraction procedure of all

parts only requires one pass through the convolutional layers. Analogous to the

localization network, the input images of the part stream are in doubled resolution

454× 454 so that the respective receptive fields are not too large to model object

parts; forwarding the network to conv5 layer generates output feature maps of

size 27× 27. By far, the computation of all object parts is completely shared.

After performing the shared feature extraction procedure, the computation

of each object part is then partitioned through a part crop layer to model part-

specific classification cues. For each part, the part crop layer extracts a local

neighborhood region centered at the detected part location. Features outside the

cropped region are simply dropped. In practice, we crop 6 × 6 neighborhood

regions out of the 27 × 27 conv5 feature maps to match the output size of the

object stream. The resultant receptive fields for the cropped feature maps has

a width of 243, given the receptive field size of conv5 layers and the respective

stride.

Object stream. The object stream utilizes bounding-box-level supervision to

capture object-level semantics for fine-grained recognition. It follows the general

architecture of CaffeNet, in which the input of the network is a 227 × 227 RGB

image and the output of pool5 layer are 6× 6 feature maps.

59

We find the design of the two-stream architecture in PS-CNN analogous to

the famous Deformable Part-based Models [24], in which object-level features are

captured through a root filter in a coarser scale, while detailed part-level infor-

mation is modeled by several part filters at a finer scale. We find it critical to

measure visual cues from multiple semantic levels in an object recognition algo-

rithm.

Dimensionality reduction and fully connected layers. The aforemen-

tioned two-stream architecture generates an individual feature map for each ob-

ject part and bounding box. When conducting classification, they serve as an

over-complete set of CNN features from multiple scales. Following the standard

CaffeNet architecture, we employ a DNN including three fully connected layers

as object classifiers. The first fully connected layer fc6 now becomes a part con-

catenation layer whose input is generated by stacking the output feature maps of

the part stream and the object stream together. However, such a concatenating

process requires M + 1 times more model parameters than the original fc6 layer

in CaffeNet, which leads to a huge memory cost.

To reduce model parameters, we introduce a 1× 1 convolutional layer termed

conv5 1 in the part stream that projects the 256 dimensional conv5 output to

32-d. It is identical to a low-rank projection of the model output and thus can be

initialized through standard PCA. Nevertheless, in our experiments, we find that

directly initializing the weights of the additional convolution by PCA in practice

worsens the performance. To enable domain-specific fine-tuning from pre-trained

CNN model weights, we train an auxiliary CNN to initialize the weights for the

additional convolutional layer.

Let Xc ∈ RN×M×6×6 be the cth 6× 6 cropped region around the center point

(h∗c , w

∗c ) from conv5 1 feature maps X ∈ R

N×M×27×27, where (h∗c , w

∗c ) is the pre-

dicted location for part c and N is the number of output feature maps. The

output of part concatenation layer fc6 can be formulated as:

fout(X) = σ(M∑c=1

(W c)TXc), (4.1)

60

where W c is the model parameters for part c in fc6 layer, and σ is an activation

function.

We conduct the standard gradient descent method to train the classification

network. The most complicated part for computing gradients lies in the dimension

reduction layer due to the impact of part cropping. Specifically, the gradient of

each cropped part feature map (in 6 × 6 spatial resolution) is projected back to

the original size of conv5 (27× 27 feature maps) according to the respective part

location and then summed up. Note that the proposed PS-CNN is implemented

as a two stage framework, i.e. after training the FCN, weights of the localization

network are fixed when training the classification network.

4.4 Deeper Part-Stacked CNN

A key motivation of our proposed method is to produce a fine-grained recogni-

tion system that not only considers recognition accuracy but also addresses effi-

ciency and interpretability. To ensure that the resulting model is interpretable,

we employ strong part-level annotations with the potential to provide human-

understandable classification criteria. We also adopt several strategies such as

sparse prediction instead of dense prediction to eliminate part proposal genera-

tion and to share computation for all part features. For the sake of classification

accuracy, we learn a comprehensive representation by incorporating both global

(object-level) and local (part-level) features. Based on these, in this section, we

present the model architecture of the proposed Deeper Part-Stacked CNN (DPS-

CNN).

According to the common framework for fine-grained recognition, the pro-

posed architecture is decomposed into a localization network (Section 4.4.1) and

a classification network (Section 4.4.2). In our previous work [37], we adopted

CaffeNet [40], a slightly modified version of the standard seven-layer AlexNet ar-

chitecture [47], as the basic network structure. In this paper, we use a deeper but

more powerful network (BN-GoogleNet) [38] as a substitute. A unique feature of

our architecture is that the message is transferring operation from the localiza-

tion network to the classification network, which uses the detected part locations

to perform part-based classification, is conducted directly on the Inception-4a

61

convm

ax pool

…inception(4a)

conv

1x1 conv

Input image448x448

….

….

28x28x16

softmax

part label28x28

gaussian kernel28x28

TRAINING

Figure 4.5: Demonstration of the localization network. Training process is de-noted inside the dashed box. For inference, a Gaussian kernel is then introducedto remove noise. The results are M 2D part locations in the 27×27 conv5 featuremap.

output feature maps within the data forwarding process. This is a significant

departure from the standard two-stage pipeline of part-based R-CNN, which

consecutively localizes object parts and then trains part-specific CNNs on the

detected regions. Based on this design, sharing schemes are performed to make

the proposed DPS-CNN fairly efficient for both learning and inference. Figure

4.6 illustrates the overall network architecture.architecture.

4.4.1 Localization Network

The first stage in our proposed architecture is a localization network that aims

to detect the location of object parts. We employ the simplest form of part

landmark annotation, where a 2D key point is annotated at the center of each

object part. Assume that M - the number of object parts labeled in the dataset

is sufficiently large to offer a complete set of object parts in which fine-grained

categories are usually different. A naive approach to predicting these key points

is to apply FCN architecture [?] for dense pixel-wise prediction. However, this

method biases the learned predictor because, in this task and unlike semantic

segmentation, the number of keypoint annotations is extremely small compared

to the number of irrelevant pixels.

Motivated by the recent progress in object detection [77] and semantic seg-

mentation [57], we propose to use the centers of receptive fields as key point can-

didates and use a fully convolutional network to perform sparse pixel prediction

62

to locate the key points of object parts (see Figure 4.2(b)). In the field of object

detection, box candidates expected to be likely objects are first extracted using

proposal-generating methods such as selective search [100] and region proposal

networks [77]. Then, CNN features are learned to represent these box candidates

and finally used to determine their class label. We adapt this pipeline to key

point localization but omit the candidate generation process and simply treat the

centers of receptive fields corresponding to a certain layer as candidate points.

As shown in Figure 4.2(a), the advantage of using this method is that each candi-

date point can be represented by a 1D cross-channel feature vector in the output

feature maps. Also, in our candidate point evaluation experiments in Table 4.9,

we find that given an input image of size 448x448 and using the receptive fields

of the inception-4a layer in BN-GoogleNet generates 28x28 candidate points and

100% recall at [email protected].

Fully convolutional network. An FCN is achieved by replacing the parameter-

rich fully connected layers in standard CNN architectures constructed by convo-

lutional layers with kernels of spatial size. Given an input RGB image, the output

of an FCN is a feature map of reduced dimension compared to the input. The

computation of each unit in the feature map only corresponds to pixels inside a

region of fixed size in the input image, which is called its feature map. We prefer

FCNs because of the following reasons: (1) feature maps generated by FCNs can

be directly utilized as the part locating results in the classification network, as

detailed in Section 4.4.2; (2) the results of multiple object parts can be obtained

simultaneously; (3) FCNs are very efficient for both learning and inference.

Learning. We model the part localization process as a multi-class classification

problem on sparse output spatial positions. Specifically, suppose the output of

the last FCN convolutional layer is of size h× w × d, where h and w are spatial

dimensions and d is the number of channels. We set d = M + 1. Here, M

is the number of object parts and 1 denotes an additional channel to model the

background. To generate corresponding ground-truth labels in the form of feature

maps, units indexed by h×w spatial positions are labeled with their nearest object

part; units that are not close to any of the labeled parts (with an overlap with

respect to a receptive field) are labeled as background. In this way, ground-

63

truth part annotations are transformed into the form of corresponding feature

maps, while in recent works that directly apply FCNs [109, 124], the supervision

information is generated by directly resizing the part ground-truth image.

Another practical problem here is determining the model depth and the input

image size for training the FCN. Generally, layers at later stages carry more

discriminative power and, therefore, are more likely to generate good localization

results; however, their receptive fields are also much larger than those of previous

layers. For example, the receptive field of the inception-4a layer in BN-GoogleNet

has a size of 107×107 compared to the 224×224 input image, which is too large to

model an object part. We propose a simple trick to deal with this problem, namely

upsampling the input images so that the fixed size receptive fields denoting object

parts become relatively smaller compared to the whole object, while still using

later stage layers to guarantee discriminative power. In the proposed architecture,

the input image is upsampled to double the resolution and the inception-4a layer

is adopted to guarantee discrimination.

The localization network is illustrated in Figure 4.5. The input images are

warped and resized into a fixed size of 448 × 448. All layers from the beginning

to the inception-4a layer are cut from the BN-GoogleNet architecture, so the

output size of the inception-4a layer is 28× 28× 576. Then, we further introduce

an 1 × 1 convolutional layer with M + 1 outputs termed conv for classification.

By adopting a location-preserving softmax that normalizes predictions at each

spatial location of the feature map, the final loss function is a sum of softmax

loss at all 28× 28 positions:

L = −28∑h=1

28∑w=1

log σ(h, w, c), (4.2)

where

σ(h, w, c) =exp(fconv(h, w, c))∑Mc=0 exp(fconv(h, w, c))

.

Here, c ∈ [0, 1, ...,M ] is the part label of the patch at location (h, w), where the

label 0 denotes background. fconv(h, w, c) stands for the output of conv layer at

spatial position (h, w) and channel c.

64

convm

ax pool…

inception(4a)

convm

ax pool…

inception(4a)

inception(5b)

avg pool (14x14)

convsoftm

axpartPredict ….

….

28x28xN

28x28x576

…

…

P1 7x7x576

…

…

….

P2 7x7x576

PN 7x7x576

avg poolinception(4e)

……

P1 7x7x576

…

……

….

P2 7x7x576

PN 7x7x576

Part Crop

P1

1024-d

P2

1024-d

PN

1024-d

….

1024-d

Fusion

Fusion feature

Kfc + softmax

inception(5a)

inception(4e)

inception(4d)

inception(4c)

inception(4b)

inception(4a)

inception(3c)

inception(3b)

inception(3a)

max poolconv

max poolconv

convm

ax pool…

inception(4a)conv

softmax

partPredict …………….……

…………….….

28x28xN

PP

convm

ax pool…

inception(4a)

28x28x576

…

inception(5b)

avg pool(14x14)

inception(5a)

inception(4e)

inception(4d)

inception(4c)

inception(4b)

inception(4a)

inception(3c)

inception(3b)

inception(3a)

max poolconv

max poolconv

P1

1024-d

P2

1024-d

PN

1024-d

….

1024-d

FusF

(1)

(2)

(3)(4)

Figure 4.6: Network architecture of the proposed Deeper Part-Stacked CNN. Themodel consists of: (1) a fully convolutional network for part landmark localiza-tion; (2) a part stream where multiple parts share the same feature extractionprocedure, while being separated by a novel part crop layer given detected partlocations; (3) an object stream to capture global information; and (4) Featurefusion layer with input feature vectors from part stream and object stream toachieve the final feature representation.

Inference. Inference starts from the output of the learned FCN, i.e., (M + 1)

part-specific heat maps of size 28×28, in which we introduce a Gaussian kernel G

to remove isolated noise in the feature maps. The final output of the localization

network are M locations in the 28 × 28 conv feature map, each of which is

computed as the location with the maximum response for one object part.

Meanwhile, considering that object parts may be missing in some images due

to varied poses and occlusion, we set a threshold μ that if the maximum response

of a part is below μ, we simply discard this part’s channel in the classification

network for this image. Let g(h, w, c) = σ(h, w, c) ∗ G, the inferred part locations

are given as:

(h∗c , w

∗c ) =

{argmax h,w g(h, w, c) if g(h∗

c , w∗c , c) > μ,

(−1,−1) otherwise.(4.3)

65

P1

P2

PN

….

(N+1) x 1024-d

P1

1024-d

P2

1024-d

PN

1024-d

….

1024-d

P1

1024-d

P2

1024-d

PN

1024-d

….

1024-d

scale layerscale layer

scale layer…

.1024-d

SUM

scale layer

P1

1024-d

P2

1024-d

PN

1024-d

….

1024-d


scale layer…

.

2048-dAVG

scale layer

MAX

P1

1024-d

P2

1024-d

PN

1024-d

….

1024-d


scale layer…

.

1024-d

MAX

scale layer

(a) FC (b) SS

(c) SM (d) SAM

Figure 4.7: Different strategies for feature fusion which are illustrated in (a) Fullyconnected,(b) Scale Sum, (c) Scale Max and (d) Scale Average Max respectively.

66

4.4.2 Classification network

The second stage of the proposed DPS-CNN is a classification network with the

inferred part locations given as an input. As shown in Figure 4.6, it follows a

two-stream architecture with a Part Stream and a Object Stream to capture se-

mantics from different angles. The outputs of both two streams are fed into a

feature fusion layer followed by a fully connected layer and a softmax layer.

Part stream. The part stream is the core of the proposed DPS-CNN archi-

tecture. To capture object part-dependent differences between fine-grained cate-

gories, one can train a set of part CNNs, each one of which conducts classification

on a part separately, as proposed by Zhang et al . [120]. Although such method

works well for situations employing two object parts [120], we argue that this

approach is not applicable when the number of object parts is much larger, as in

our case, because of the high time and space complexities.

We introduce two strategies to improve part stream efficiency, the first being

model parameter sharing. Specifically, model parameters of layers before the part

crop layer and inception-4e are shared among all object parts and can be regarded

as a generic part-level feature extractor. This strategy reduces the number of

parameters in the proposed architecture and thus reduces the risk of overfitting.

We also introduce a part crop layer as a computational sharing strategy. The

layer ensures that the feature extraction procedure of all parts only requires one

pass through the convolutional layers.

After performing the shared feature extraction procedure, the computation

of each object part is then partitioned through a part crop layer to model part-

specific classification cues. As shown in Figure 4.6, the input for the part crop

layer is a set of feature maps (the output of inception-4a layer in our architec-

ture) and the predicted part locations from the previous localization network,

which also reside in inception-4a feature maps. For each part, the part crop layer

extracts a local neighborhood centered on the detected part location. Features

outside the cropped region are simply discarded. In practice, we crop l×h neigh-

borhood regions from the 28×28 inception-4a feature maps. The cropped size of

feature regions may have an impact on recognition performance, because larger

67

crops will result in redundancy when extracting multiple part features, while

smaller crops cannot guarantee rich information. For simplicity, we use l = h = 7

in this paper to ensure that the resulting receptive field is large enough to cover

the entire part.

Object stream. The object stream captures object-level semantics for fine-

grained recognition. It follows the general architecture of BN-GoogleNet, in which

the input of the network is a 448×448 RGB image and the output of incenption-

5b layer are 14 × 14 feature maps. Therefore, we use 14 × 14 average pooling

instead of 7× 7 in original setting.

The design of the two-stream architecture in DPS-CNN is analogous to the

famous Deformable Part-based Models [24], in which object-level features are cap-

tured through a root filter in a coarser scale, while detailed part-level information

is modeled by several part filters at a finer scale. We find it critical to measure

visual cues from multiple semantic levels in an object recognition algorithm.

We conduct the standard gradient descent to train the classification network.

It should be noted, however, that the gradient of each element ∂E∂Xi,j

in inception-

4a feature maps is calculated by the following equation:

∂E

∂Xi,j

=M∑c=1

φ(∂E

∂Xci,j

), (4.4)

where E is the loss function, Xci,j is the feature maps cropped by part c and

φ(∂E

∂Xci,j

) =

⎧⎨⎩

∂E∂Xc

i,jXi,j corresponding to Xc

i,j,

0 otherwise.(4.5)

Specifically, the gradient of each cropped part feature map (in 7 × 7 spatial

resolution) is projected back to the original size of inception-4a (28× 28 feature

maps) according to the respective part location and then summed. The computa-

tion of all other layers simply follows the standard gradient rules. Note that the

proposed DPS-CNN is implemented as a two stage framework, i.e. after training

the FCN, weights of the localization network are fixed when training the classifi-

68

cation network.

Feature Fusion

The commonest method [50,120] for combining all part-level and object-level

features is to simply concatenate all these feature vectors as illustrated in Figure

4.7(a). However, this approach may cause feature redundancy and also suffer

from high-dimensionality when part numbers become large. To effectively utilize

all part- and object-level features, we present three options for learning fusion

features: scale sum (SS), scale max (SM), and scale mean-max (SMM), as il-

lustrated in Figure 4.7(a), Figure 4.7(b), and Figure 4.7(d), respectively. All

three methods include the shared process of placing a scale layer on top of each

branch. Nevertheless, as indicated by their names, the scale sum feature is the

element-wise sum of all output branches, the scale max feature is generated by

an element-wise maximum operation, while the scale average-max feature is the

concatenation of element-wise mean and max features. In our previous work [37]

based on the standard CaffeNet architecture, each branch from the part stream

and the object stream was connected with an independent fc6 layer to encourage

diversity features, and the final fusion feature was the sum of all the outputs of

these fc6 layers. As this fusion process requires M + 1 times model parameters

more than the original fc6 layer in CaffeNet and consequently incurs a huge mem-

ory cost, a 1 × 1 convolutional layer is used for dimensionality reduction. Here

we redesign this component for simplicity and to improve performance. First, a

shared inception module is placed on top of the cropped part region to generate

higher level features. Also, a scale layer follows each branch feature to encour-

age diversity between parts. Furthermore, the scale layer has fewer parameters

than the fully connected layer and, therefore, reduces the risk of overfitting and

decreases the model storage requirements.

4.5 Experiments

In this section we present experimental results and a thorough analysis of the

proposed methods. Specifically, we evaluate the performance from four different

aspects: localization accuracy, classification accuracy, inference efficiency, and

69


4.5.1 Dataset and implementation details

Experiments are conducted on the widely used fine-grained classification bench-

mark the Caltech-UCSD Birds dataset (CUB-200-2011) [105]. The dataset con-

tains 200 bird categories with roughly 30 training images per category. In the

training phase we adopt strong supervision available in the dataset, i.e. we em-

ploy 2D key point part annotations of altogether M = 15 object parts together

with image-level labels and object bounding boxes.

The labeled parts1 imply places where people usually focus on when being

asked to classify fine-grained categories; thus they provide valuable information

for generating human-understandable systems.

Both Part-Stacked CNN and Deeper Part-Stacked CNN architecture are im-

plemented using the open-source package Caffe [40]. Specifically, input images

are warped to a fixed size of 512 × 512, randomly cropped into 448 × 448, and

then fed into the localization network and the part stream in the classification

network as input.

4.5.2 Localization results for PSCNN

As the localization results in our method are directly delivered to the classifica-

tion network at feature-map-level, we do not intend to achieve accurate keypoint

localization at pixel-level but instead, focus on a rougher correctness measure.

The localization accuracy is quantitatively assessed using APK (Average Preci-

sion of Key points) [117]. Following [58], we consider a key point to be correctly

predicted if the prediction lies within a Euclidean distance of α times the maxi-

mum of the bounding box width and height compared to the ground truth. We

set α = 0.1 in all the analysis below.

The adopted FCN architecture in PS-CNN achieves a reasonably inspiring

86.6% APK on the test set of CUB-200-2011 for 15 object parts. Specifically,

the additional 1 × 1 convolutional layer and the employed Gaussian smoothing

1The 15 object parts are back, beak, belly, breast, crown, forehead, left eye, left leg, leftwing, nape, right eye, right leg, right wing, tail, and throat.

70

Figure 4.8: Typical localization results on CUB-200-2011 test set. We show 6 ofthe 15 detected parts here. They are: beak (red), belly (green), crown (blue),right eye (yellow), right leg (magenta), tail (cyan). Better viewed in color.

part throat beak crown forehead right eye nape left eye back

APK 0.908 0.894 0.894 0.885 0.861 0.857 0.850 0.807

part breast belly right leg tail left leg right wing left wing overall

APK 0.799 0.794 0.775 0.760 0.750 0.678 0.670 0.866

Table 4.1: APK for each object part in the CUB-200-2011 test set in descendingorder.

kernel delivers 1.5% and 2% improvements over the results using standard five

convolutional layers in AlexNet, respectively. To further understand the perfor-

mance gains from our network designs, we also show experimental comparisons on

different model architectures in Table 4.2 using the following evaluation metrics.

a) Mean Precision of Key points over images (MPK).

b) Mean Recall of Key points over images (MRK).

c) Average Precision of Key points (APK).

71

Model architecture MPK MRK APK

conv5+cls 70.0 80.6 83.5conv5+conv6(256)+cls 71.3 81.8 84.7conv5+conv6(512)+cls 71.5 81.9 84.8conv5+conv6(512)+cls+gaussian 80.0 83.8 86.6

Table 4.2: Comparison of different model architectures on localization results.“conv5” stands for the first 5 convolutional layers in CaffeNet; “conv6(256)”stands for the additional 1 × 1 convolutional layer with 256 output channels;“cls” denotes the classification layer with M + 1 output channels; “gaussian”represents a Gaussian kernel for smoothing.

BBox only +2 part +4 part +8 part +15 part

69.08 73.72 74.84 76.63 76.41

Table 4.3: The effect of increasing the number of object parts on the classificationaccuracy.

Furthermore, we present per part APK s in Table 4.1. An interesting phe-

nomenon here is that parts residing near the head of the birds tend to be located

more accurately. It turns out that the birds’ head has a relatively more stable

structure with fewer deformations and lower probability to be occluded. On the

contrary, parts that are highly deformable such as wings and legs get lower APK

values. Figure 4.8 shows typical localization results of the proposed method.

4.5.3 Classification results for PSCNN

We begin the analysis of classification results by a study on the discriminative

power of each object part. Each time we select one object part as the input and

discard the computation of all other parts. Different parts reveal significantly

different classification results. The most discriminative part crown itself achieves

a quite impressive accuracy of 57%, while the lowest accuracy is only 10% for

part beak. Therefore, to obtain better classification results, it may be beneficial

to find a rational combination or order of object parts instead of directly ran the

experiments on all parts altogether.

We, therefore, introduce a strategy that incrementally adds object parts to

72

BBox only +2 part +4 part +8 part +15 part

69.08 73.72 74.84 76.63 76.41

Table 4.4: The effect of increasing the number of object parts on the classificationaccuracy.

the whole framework and iteratively trains the model. Specifically, starting from

a model trained on bounding-box supervision only, which is also the baseline of

the proposed method, we iteratively insert object parts into the framework and

re-finetune the PS-CNN model. The number of parts added in each iteration in-

creases exponentially, i.e., in the ith iteration, 2i parts are selected and inserted.

When starting from an initialized model with relatively high performance, intro-

ducing a new object part into the framework does not require to run a brand

new classification procedure based on this particular part alone; ideally, only the

classification of highly confusing categories that may be distinguished from the

new part will be impacted and amended. As a result, this procedure overcomes

the drawback raised by the existence of object parts with lower discriminative

power. In our implementation, the ordering of part inclusion is determined by its

discriminative power measured by the classification accuracy using each part only

(see Supplementary for details). Table 4.4 reveals that as the number of object

parts increases from 0 to 8, the classification accuracy improves gradually and

then becomes saturated. Further increasing the part number does not lead to a

better accuracy; however, it does provide more resources for performing explicit


Table 4.11 shows the performance comparison between PS-CNN and exist-

ing fine-grained recognition methods. Since the CNN architecture has a large

impact on the recognition performance, for a fair comparison, we only compare

results reported on the standard seven-layer architecture. Deeper models could

undoubtedly lead to better accuracy but also result in less efficiency. The com-

plete PS-CNN model with a bounding-box and 15 object parts achieves 76%

accuracy, which is comparable with part-based R-CNN [120], while being slightly

lower than several most recent state-of-the-art methods [50, 52, 84] due to the

effectiveness-efficiency tradeoff. In particular, our model is over two orders of

73

magnitude faster than [120], requiring only 0.05 seconds to perform end-to-end

classification on a test image. This number is quite inspiring, especially consid-

ering the number of parts used in the proposed method. The efficiency makes

it possible for the proposed method to be conducted in real-time, leading to

potential applications in the video domain.

Method Train Anno. Test Anno. Acc.

Constellation [86] n/a n/a 68.5Attention [111] n/a n/a 69.7Bilinear-CNN [52] n/a n/a 74.2Weak FGVC [127] n/a n/a 75.0

CNNaug [76] BBox BBox 61.8Alignment [28] BBox BBox 67.0No parts [46] BBox BBox 74.9Bilinear-CNN [52] BBox BBox 80.4

Part R-CNN [120] BBox+Parts n/a 73.9PoseNorm CNN [11] BBox+Parts n/a 75.7

POOF [6] BBox+Parts BBox 56.8DPD+DeCAF [22] BBox+Parts BBox 65.0Deep LAC [50] BBox+Parts BBox 80.2Multi-proposal [84] BBox+Parts BBox 80.3Part R-CNN [120] BBox+Parts BBox 76.4PS-CNN BBox+Parts BBox 76.6

Table 4.5: Comparison with state-of-the-art methods on the CUB-200-2011dataset. To conduct fair comparisons, for all the methods using deep features,we report their results on the standard seven-layer architecture (mostly ALexNetexcept VGG-m for [52]) if possible. Note that our method achieves comparableresults with state-of-the-art while running in real-time.

4.5.4 Localization Results for DPSCNN

Following [58], we consider a key point to be correctly predicted if the predic-

tion lies within a Euclidean distance of α times the maximum of the input width

and height compared to the ground truth. Localization results are reported on

multiple values of α ∈ {0.1, 0.05, 0.02} in the analysis below. The value α in the

PCK metric is introduced to measure the error tolerance in keypoint localization.

74

To investigate the effect of the selected layer for keypoint localization, we per-

form experiments using the inception-4a,inception-4b,inception-4c and inception-

4d layers as part detector layers. As shown in Table 4.7, a higher layer with a

larger receptive field tends to achieve better localization performance than a lower

layer with α = 0.1. This is mainly because the larger receptive fields are crucial

for capturing spatial relationships between parts and improve performance (see

Table 4.6). However, in contrast, for α = 0.05 or 0.02, the performance decreases

at deeper layers. One possible explanation is that although higher layers obtain

better semantic information about the object, they lose more detailed spatial in-

formation. To evaluate the effectiveness of our key point localization approach,

we also compare it with recently published works [37, 118, 124] providing PCK

evaluation results on CUB-200-2011 along with experimental results using a more

consistent evaluation metric called average precision of key points (APK), which

correctly penalizes both missed and false-positive detections [117]. As can be

seen from the Table 4.7, our method outperforms existing techniques with vari-

ous α setting regarding PCK. Also, the most striking result is that our approach

outperforms the compared methods with large margins when using small α value.

For the key point localization task, we follow the proposal-based object detec-

tion method pipeline; centers of receptive fields corresponding to a certain layer

are first regarded as candidate points and then forwarded to a fully convolutional

network for further classification. Similar to object detection using proposals,

whether selected candidate points have a good coverage of pixels of interest in

the test image plays a crucial role in keypoint localization since missed key points

cannot be recovered in subsequent classification. Thus, we first evaluate the can-

didate point sampling method. The evaluation is based on the PCK metric [117],

in which the error tolerance is normalized by the input image size. For consis-

tency with evaluation of key point localization, a ground truth point is recalled

if there exists a candidate point matched regarding the PCK metric. Table 4.9

shows the localization recall of candidate points selected by inception-4a with

different α values 0.05, 0.02 and 0.01. As expected, candidate points sampled by

layer inception-4a have a great coverage of ground truth using PCK metric with

α > 0.02. However, the recall drops dramatically when using α = 0.01. This

mainly because of the large stride(16) in inception-4a layer, which results in the

75

Table 4.6: Receptive field size of different layers.

Layer Rec. Field

Inception-4a 107× 107Inception-4b 139× 139Inception-4c 171× 171Inception-4d 204× 204

Table 4.7: Comparison of per-part PCK(%) and over-all APK(%) on CUB200-2011. The abbreviated part names from left to right are: Back, Beak, Belly,Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, Right Eye, RightLeg, Right Wing,Tail, and Throat

α Methods Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Avg APK

0.1

[37] 80.7 89.4 79.4 79.9 89.4 88.5 85.0 75.0 67.0 85.7 86.1 77.5 67.8 76.0 90.8 81.2 86.6[124] 85.6 94.9 81.9 84.5 94.8 96.0 95.7 64.6 67.8 90.7 93.8 64.9 69.3 74.7 94.5 83.6 -[118] 94.0 82.5 92.2 93.0 92.2 91.5 93.3 69.7 68.1 86.0 93.8 74.2 68.9 77.4 93.4 84.7 -

Ours(4a) 82.7 94.1 85.3 87.8 95.2 93.3 88.6 75.5 75.9 92.0 89.5 76.6 75.9 67.4 94.7 84.9 89.1Ours(4b) 87.4 93.6 87.4 88.9 95.2 93.7 88.3 73.3 77.6 93.4 88.9 76.3 79.0 70.5 94.5 85.9 88.9Ours(4c) 89.0 95.1 91.5 92.6 95.7 94.7 90.3 78.5 82.3 94.4 91.0 73.2 81.9 78.4 95.7 88.3 90.9Ours(4d) 89.0 95.0 92.2 93.2 95.2 94.2 90.5 73.2 81.5 94.4 91.6 75.5 82.3 83.2 95.8 88.5 91.2

0.05

[37] 48.8 63.7 44.5 50.3 50.2 43.7 80.0 44.8 42.7 60.1 59.4 46.5 39.8 46.8 71.9 52.9 62.7[124] 46.8 62.5 40.7 45.1 59.8 63.7 66.3 33.7 31.7 54.3 63.8 36.2 33.3 39.6 56.9 49.0 -[118] 66.4 49.2 56.4 60.4 61.0 60.0 66.9 32.3 35.8 53.1 66.3 35.0 37.1 40.9 65.9 52.4 -


0.02

[37] 11.1 16.9 9.1 11.2 5.2 4.1 40.4 9.4 10.8 14.6 9.9 11.9 9.6 11.2 22.3 13.2 13.3[124] 9.4 12.7 8.2 12.2 13.2 11.3 7.8 6.7 11.5 12.5 7.3 6.2 8.2 11.8 56.9 13.1 -[118] 18.8 12.8 14.2 15.9 15.9 16.2 20.3 7.1 8.3 13.8 19.7 7.8 9.6 9.6 18.3 13.8 -


distance between two closest candidate points is 16 pixels while setting an input

size of 448 with α = 0.01 requires the candidate point should be close to the

ground truth within 4.48 pixels.

The part localization architecture adopted in DPS-CNN achieves a highest av-

erage [email protected] 88.5% on the CUB-200-2011 test set for 15 object parts. Specif-

ically, the employed Gaussian smoothing kernel delivers 2% improvements over

methods that use standard convolutional layers in BN-GoogleNet. Figure 4.8

shows typical localization results using the proposed method.

76

Table 4.8: Localization recall of candidate points selected by inception-4a layerwith different α values. The abbreviated part names from left to right are: Back,Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, RightEye, Right Leg, Right Wing, Tail, and Throat

Part Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th

Accuracy(%) 47.9 63.7 43.9 56.8 66.8 66.1 36.6 30.8 30.4 64.8 36.1 29.2 29.7 20.0 68.7

Groundtruth

Prediction

Back

Beak

Belly

Breast

Crown

Forehead

Left Eye

Left Leg

Left Wing

Right Eye

Right Wing

Nape

Tail

Right Leg

Throat

Groundtruth

Prediction

Figure 4.9: Typical localization results on CUB-200-2011 test set. Better viewedin color.

77

Figure 4.10: Feature maps visualization of Inception-4a layer. Each exampleimage is followed by three rows of top six scoring feature maps, which are from thepart stream, object stream and and baseline BN-inception network respectively.Red dash box indicates a failure case of visualization using the model learned byour approach.

78

4.5.5 Classification results for DPSCNN

We begin our classification analysis by studying the discriminative power of each

object part. We select one object part each time as the input and discard the

computation of all other parts. As shown in Table 4.8, different parts produce sig-

nificantly different classification results. The most discriminative part ”Throat”

achieves a quite impressive accuracy of 68.7%, while the lowest accuracy is 20.0%

for the part ”Tail”. Therefore, to improve classification, it may be beneficial to

find a rational combination or order of object parts instead of directly running

the experiment on all parts altogether. More interestingly, when comparing the

results between Table 4.7 and Table 4.8 it can be seen that parts located more

accurately such as Throat, Nape, Forehead and Beak tend to achieve better per-

formance in the recognition task, while some parts like Tail and Left Leg with

poor localization accuracy perform worse. This observation may support the

hypothesis that a more discriminative part is easier to locate in the context of

fine-grained categorization and vice versa.

To evaluate our frameworks overall performance, we first train a baseline

model with accuracy 81.56% using a BN-Inception architecture [38] with pre-

training on ImageNet [79]. By stacking certain part features and applying our

proposed fusion method, our framework improves the performance to 85.12%.

Also, to evaluate our proposed feature fusion method, we then train four DPS-

CNN models with same experimental settings (maximum iteration and learning

rate) but using different feature fusion methods. The results are shown in Table

?? (Rows 2-5) demonstrate that SMM fusion achieves the best performance and

outperforms the FC method by 1.69%.

To investigate which parts should be selected in our learning framework, we

conduct the following experiments by employing two guiding principles: one con-

cerns the feature discrimination and the other feature diversity. Here we consider

parts with higher accuracy in Table ?? are more discriminative, and the combina-

tion of parts with distant location are more diverse. We firstly select top 6 parts

with the highest accuracy from Table ?? by only applying the discriminative prin-

ciple, then choose 3,5,9 and 15 parts respectively by taking two principles into

account. Experimental results are shown in Table ?? (Row 6-10), we observe that

79

Table 4.9: Localization recall of candidate points selected by inception-4a layerwith different α values. The abbreviated part names from left to right are: Back,Beak, Belly, Breast, Crown, Forehead,Left Eye,Left Leg, Left Wing, Nape, RightEye, Right Leg, Right Wing, Tail, and Throat

α Ba Bk Be Br Cr Fh Le Ll Lw Na Re Rl Rw Ta Th Avg

0.05 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 1000.02 90.8 89.8 90.8 90.4 90.9 91.4 90.4 90.4 90.0 90.7 90.3 89.9 90.3 90.5 90.3 90.50.01 26.8 26.3 9.1 11.2 5.2 4.1 40.4 9.4 10.8 14.6 9.9 11.9 9.6 11.2 22.3 13.2

Table 4.10: Comparison of different settings of our approach on CUB200-2011 .

Row Setting Acc(%)

1 Object Only(Baseline) 81.56

2 5-parts + FC 81.863 5-parts + SS 83.064 5-parts + SM 83.415 5-parts + SMM 83.55

6 6-parts + SMM 84.12

7 3-parts + SMM 84.298 5-parts + SMM 84.919 9 parts + SMM 85.1210 15-parts + SMM 84.45

increasing part numbers bring slight improvement. However, all setting perform

better than that with six most discriminative parts. This mainly because most of

these parts are adjacent to each other so that it fails to produce diverse feature

in our framework. Also, it should be noticed that using all parts feature does

not guarantee the best performance, on the other hand, results in poor accuracy.

This finding shows that the feature redundancy caused by appending the exor-

bitant number of parts in learning, may degrade the accuracy, and suggests that

an appropriate strategy for integrating multiple parts is critical.

We also present the performance comparison between DPS-CNN and existing

fine-grained recognition methods. As can be seen in Table 4.11, our approach

using only keypoint annotation during training achieve 85.12% accuracy which

is comparable with the state-of-the-art method [52] that achieves 85.10% using

bounding box both in training and testing. Moreover, it is interpretable and faster

80

MethodTraining Testing

Pre-trained Model. FPS 2 Acc(%).

BBox Parts BBox PartsPart-Stacked CNN [37] � � � AlexNet 20 76.62Deep LAC [50] � � � AlexNet - 80.26Part R-CNN [120] � � � AlexNet - 76.37SPDA-CNN [119] � � � VGG16 - 84.55SPDA-CNN [119]+ensemble � � � VGG16 - 85.14Part R-CNN [120] without BBox � � AlexNet - 73.89PoseNorm CNN [11] � � AlexNet - 75.70Bilinear-CNN (M+D+BBox) [52] � � VGG16+VGGM 8 85.10Bilinear-CNN (M+D) [52] VGG16+VGGM 8 84.10Constellation-CNN [86] VGG19 - 84.10Spatial Transformer CNN [39] Inception+BN - 84.10Two-Level [111] VGG16 - 77.90Co-Segmentation [46] � � VGG19 - 82.80DPS-CNN with 9 parts � Inception+BN 32 85.12DPS-CNN ensemble with 4 models � Inception+BN 8 86.56

Table 4.11: Comparison with state-of-the-art methods on the CUB-200-2011dataset.

- the entire forward pass of DPS-CNN runs at 32 frames/sec (NVIDIA TitanX),

while B-CNN[D, M] [52] runs at 8 frames/sec (NVIDIA K40)1. In particular, our

method is much faster than proposal based methods such as [120] and [119] which

require multiple networks forward propagation for proposal evaluation, while part

detection and feature extraction are accomplished efficiently by running one for-

ward pass in our approach. In addition, we combine four models stemmed from

integrating different parts(listed in Table 4.10 (Row 7-10)) to form an ensemble

which leads to 86.56% accuracy on cub-200-2011.

To understand what features are learned in DPS-CNN, we use the aforemen-

tioned five-parts model and show its feature map visualization compared with

that from BN-Inception model fine-tuning on cub-200-2011. Specifically, we pick

the top six scoring feature maps of Inception-4a layer for visualization, where the

score is the sum over each feature map. As shown in Figure 4.10, each example

image from test set is followed by three rows of feature maps, from top row to

bottom, which is selected from the part stream, object stream, and BN-inception

baseline network respectively. Interestingly, by comparison, our part stream have

learned feature maps that appear to be more intuitive than those learned by

the other two methods. Specifically, it yields more focused and cleaner patterns

1note that the computational power of TitanX is around 1.5 times of that of K40).

81

which tend to be highly activated by the network. Moreover, we can observe

that object stream and baseline network are more likely to activate filters with

extremely high-frequency details but at the expense of extra noise, while part

stream tends to obtain a mixture of low and mid frequency information. The red

dashed box in Figure 4.10 indicates a failure example, in which both our part

stream and object stream fails to learn useful feature. This may be caused by

our part localization network fails to locate Crown and Left Leg parts because

the branch in this image looks similar to bird legs and another occluded bird also

has an effect on locating the Crown part.

4.5.6 Model interpretation

One of the most prominent features of DPS-CNN method is that it can pro-

duce human-understandable interpretation manuals for fine-grained recognition.

Here we directly borrow the idea from [37] for interpretation using the proposed

method.

Different from [6] who directly conducted one-on-one classification on object

parts, the interpretation process of the proposed method is conducted relatively

indirectly. Since using each object part alone does not produce convincing clas-

sification results, we perform the interpretation analysis on a combination of

bounding box supervision and each single object part. The analysis is performed

in two ways: a ”one-versus-rest” comparison to denote the most discriminative

part to classify a subcategory from all other classes, and a ”one-versus-one” com-

parison to obtain the classification criteria of a subcategory with its most similar

classes.

• The “one-versus-rest” manual for an object category k. For every part p,

we compute the summation of prediction scores of the category’s positive

samples. The most discriminative part is then captured as the one with the

largest accumulated score:

p∗k = argmaxp

∑i,yi=k

S(p)ip . (4.6)

• The “one-versus-one” manual obtained by computing as the part which

82

results in the largest difference of prediction scores on two categories k and

l. We first take the respective two rows in the score matrix S, and re-

normalize it using the binary classification criterion as S ′. Afterwards, the

most discriminative part is given as:

p∗k→l = argmaxp

(∑i,yi=k

S′(p)ip +

∑j,yj=l

S′(p)jp ) (4.7)

The model interpretation routine is demonstrated in Figure 4.11. When a

test image is presented, the proposed method first conducts object classification

using the DPS-CNN architecture. The predicted category is presented as a set of

images in the dataset that are closest to the test image according to the feature

vector of each part. Except for the classification results, the proposed method

also presents classification criteria that distinguish the predicted category from

its most similar neighboring classes based on object parts. Again we use part

features but after part, cropping to retrieve nearest neighbor part patches of the

input test image. The procedure described above provides an intuitive visual

guide for distinguishing fine-grained categories.

4.6 Conclusion

In this chapter, we propose two CNN structures for fine-grained recognition,

which are Part-Stacked CNN (PS-CNN) and Deeper Part-Stacked CNN (DPS-

CNN). We design PS-CNN using simple structure for efficient inference, DPS-

CNN with deeper layers for accuracy. Both methods exploit detailed part-level

supervision, in which object parts are first located by a localization network and

then by a two-stream classification system that explicitly captures object- and

part-level information. We also present a new feature vector fusion strategy that

effectively combines both part and object stream features. Experiments on CUB-

200-2011 demonstrate the effectiveness and efficiency of our system. We also

present human-understandable interpretations of the proposed method, which

can be used as a visual field guide for studying fine-grained categorization.

It is also worth nothing that our methods can be applied to fine-grained visual

83

categorization with strong supervision and can be easily generalized to various

applications including:

a) Discarding the requirement for strong supervision. Instead of introducing

manually labeled part annotations to generate human-understandable vi-

sual guides, one can also exploit unsupervised part discovery methods [46]

to define object parts automatically, which requires far less human labeling

effort.

b) Attribute learning. The application of our approaches are not restricted to

FGVC. For instance, online shopping [60] performance could benefit from

clothing attribute analysis from local parts provided by our methods.

c) Context-based CNN. The role of local parts in our method is interchange-

able with global contexts, in particular for objects that are small and have

no apparent object parts such as volleyballs or tennis balls.

84

back (0.8662) nape (0.8600) left eye (0.8594)

vs. Hooded Oriole

forehead (0.9165) crown (0.9152) right eye (0.9143)

crown (0.9295) back (0.9271) forehead (0.9267)

Similar Class Comparison Predict Class Test Image

crown belly

Important Parts

vs. Boat tailed

Grackle

vs. Rusty

Blackbird

Yellow Headed

Blackbird

part class

part class

part class

Figure 4.11: Example of the prediction manual generated by the proposed ap-proach. Given a test image, the system reports its predicted class label with sometypical exemplar images. Part-based comparison criteria between the predictedclass and its most similar classes are shown in the right part of the image. Thenumber in brackets shows the confidence of classifying two categories by introduc-ing a specific part. We present top three object parts for each pair of comparison.For each of the parts, three part-center-cropped patches are shown for the pre-dicted class (upper rows) and the compared class (lower rows) respectively.

86

Chapter 5

Conclusions

Keypoint localization is considered as a fundamental step for image understand-

ing. Many important tasks such as object detection, object recognition and pose

estimation can greatly benefit from such a technique. The major challenges in

keypoint localization includes highly variable appearance, occlusion, high compu-

tational complexity, and insufficient annotation data. To improve the localization

accuracy and reduce the computational cost, Chapter 2 propose hierarchically su-

pervised nets (HSNs), a method that imposes hierarchical supervision within deep

convolutional neural networks (CNNs). We also explore the problem of insuffi-

cient data annotation for keypoint localization in Chapter 3. Finally, Chapter

4 explores the effectiveness of using part localization technique in addressing the

problem of fine-grained visual categorization.

Existing works mainly perform object detection and keypoint localizatoin in

two stages. However, these two tasks can complement each other, thereby learn-

ing the bounding box regression and keypoint location jointly is a valuable future

work. Also, another future direction in this area is training a semantic part de-

tector in semi-supervised or unsupervised setting has not yet been well explored,

though there has been increasing interest in discovering discriminative parts in

recent years.

87

References

[1] A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular

images,” IEEE transactions on pattern analysis and machine intelligence,

vol. 28, no. 1, pp. 44–58, 2006. 4

[2] Y. Amit and A. Trouve, “Pop: Patchwork of parts models for object recog-

nition,” International Journal of Computer Vision, vol. 75, no. 2, pp. 267–

282, 2007. 2, 8

[3] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: Peo-

ple detection and articulated pose estimation,” in CVPR, 2009. 3, 19

[4] ——, “Monocular 3d pose estimation and tracking by detection,” in Com-

puter Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.

IEEE, 2010, pp. 623–630. 3

[5] A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for large-scale sub-

category flower recognition,” in WACV. IEEE, 2013, pp. 39–45. 10, 49

[6] T. Berg and P. Belhumeur, “Poof: Part-based one-vs.-one features for

fine-grained categorization, face verification, and attribute estimation,” in

CVPR, 2013. 3, 10, 50, 52, 55, 74, 82

[7] T. Berg and P. N. Belhumeur, “How do you tell a blackbird from a crow?”

in ICCV, 2013. 10, 50

[8] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Bel-

humeur, “Birdsnap: Large-scale fine-grained visual categorization of birds,”

in CVPR, 2014. 10, 49, 50

89

REFERENCES

[9] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,” in

NIPS, 2010. 55

[10] L. Bo, C. Sminchisescu, A. Kanaujia, and D. Metaxas, “Fast algorithms

for large scale conditional 3d prediction,” in Computer Vision and Pattern

Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp.

1–8. 4

[11] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species cate-

gorization using pose normalized deep convolutional nets,” arXiv preprint

arXiv:1406.2952, 2014. 10, 50, 52, 55, 74, 81

[12] S. Branson, G. Van Horn, C. Wah, P. Perona, and S. Belongie, “The ig-

norant led by the blind: A hybrid human–machine vision system for fine-

grained categorization,” IJCV, vol. 108, no. 1-2, pp. 3–29, 2014. 10, 50

[13] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and

S. Belongie, “Visual recognition with humans in the loop,” in ECCV, 2010.

55

[14] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d

pose estimation using part affinity fields,” in CVPR, 2017. 16, 19, 30

[15] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose esti-

mation with iterative error feedback,” in CVPR, 2016. 19

[16] Y. Chai, V. Lempitsky, and A. Zisserman, “Symbiotic segmentation and

part localization for fine-grained categorization,” in ICCV, 2013. 3, 10, 18,

38, 50

[17] X. Chu, W. Ouyang, H. Li, and X. Wang, “Structured feature learning

for pose estimation,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2016. 19

[18] T. F. Cootes, G. J. Edwards, C. J. Taylor et al., “Active appearance mod-

els,” TPAMI, vol. 23, no. 6, pp. 681–685, 2001. 52

90

REFERENCES

[19] N. Dalal and B. Triggs, “Histograms of oriented gradients for human de-

tection,” in CVPR, 2005. 4, 18

[20] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose estimation

using body parts dependent joint regressors,” in CVPR, 2013. 19

[21] J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-

grained recognition,” in CVPR, 2013. 10, 50, 55

[22] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and

T. Darrell, “Decaf: A deep convolutional activation feature for generic vi-

sual recognition,” arXiv preprint arXiv:1310.1531, 2013. 74

[23] R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, and L. S. Davis,

“Birdlets: Subordinate categorization using volumetric primitives and pose-

normalized appearance,” in ICCV. IEEE, 2011, pp. 161–168. 8

[24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Ob-

ject detection with discriminatively trained part-based models,” TPAMI,

vol. 32, no. 9, pp. 1627–1645, 2010. 2, 8, 18, 38, 60, 68

[25] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object

recognition,” IJCV, vol. 61, no. 1, pp. 55–79, 2005. 18

[26] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised

visual domain adaptation using subspace alignment,” in Proceedings of the

IEEE International Conference on Computer Vision, 2013, pp. 2960–2967.

39

[27] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by back-

propagation,” in Proceedings of the 32nd International Conference on Ma-

chine Learning (ICML-15), 2015, pp. 1180–1189. 36, 39, 42, 43

[28] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars,

“Fine-grained categorization by alignments,” in ICCV, 2013. 74

[29] ——, “Local alignments for fine-grained categorization,” IJCV, vol. 111,

no. 2, pp. 191–212, 2015. 55

91

REFERENCES

[30] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi, “Domain gen-

eralization for object recognition with multi-task autoencoders,” in Pro-

ceedings of the IEEE International Conference on Computer Vision, 2015,

pp. 2551–2559. 39

[31] R. Girshick, “Fast r-cnn,” in ICCV, 2015. 53

[32] G. Gkioxari, R. Girshick, and J. Malik, “Actions and attributes from wholes

and parts,” in CVPR, 2015. 52, 55

[33] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsu-

pervised domain adaptation,” in Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2066–2073. 39

[34] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using regions,”

in CVPR, 2009. 20

[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in CVPR, 2016. 1, 9, 16, 19

[36] A. Holub, P. Perona, and M. C. Burl, “Entropy-based active learning for

object recognition,” in Computer Vision and Pattern Recognition Work-

shops, 2008. CVPRW’08. IEEE Computer Society Conference on. IEEE,

2008, pp. 1–8. 39

[37] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for fine-grained

visual categorization,” in CVPR, 2016. 8, 16, 52, 61, 69, 75, 76, 81, 82

[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network

training by reducing internal covariate shift,” in ICML, 2015. 9, 51, 61, 79

[39] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer net-

works,” in NIPS, 2015. 81

[40] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast

feature embedding,” in ICM, 2014. 34, 56, 61, 70

92

REFERENCES

[41] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learn-

ing for image classification,” in Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 2372–2379. 39

[42] A. Kanaujia, C. Sminchisescu, and D. Metaxas, “Semi-supervised hierar-

chical models for 3d human pose reconstruction,” in Computer Vision and

Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007,

pp. 1–8. 4

[43] L. Karlinsky and S. Ullman, “Using linking features in learning non-

parametric part models,” in ECCV, 2012. 19

[44] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for

fine-grained image categorization: Stanford dogs,” in Proc. CVPR Work-

shop on Fine-Grained Visual Categorization (FGVC), 2011. 10, 49

[45] M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial

landmarks in the wild: A large-scale, real-world database for facial land-

mark localization,” in Computer Vision Workshops (ICCV Workshops),

2011 IEEE International Conference on. IEEE, 2011, pp. 2144–2151. 1

[46] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without

part annotations,” in CVPR, 2015. 49, 55, 74, 81, 84

[47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with

deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105. 9, 10,

50, 51, 55, 56, 61

[48] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C.

Lopez, and J. V. Soares, “Leafsnap: A computer vision system for auto-

matic plant species identification,” in ECCV, 2012. 10, 50

[49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning

applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,

pp. 2278–2324, 1998. 9

93

REFERENCES

[50] D. Lin, X. Shen, C. Lu, and J. Jia, “Deep lac: Deep localization, alignment

and classification for fine-grained recognition,” in CVPR, 2015. 52, 53, 69,

73, 74, 81

[51] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in con-

text,” in ECCV, 2014. 29, 35

[52] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-

grained visual recognition,” in ICCV, 2015. x, 10, 50, 51, 52, 55, 73, 74,

80, 81

[53] Z. Lin, G. Hua, and L. S. Davis, “Multiple instance ffeature for robust

part-based object detection,” in Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 405–412. 2, 8

[54] J. Liu and P. N. Belhumeur, “Bird part localization using exemplar-based

models with enforced pose and subcategory consistency,” in ICCV, 2013.

18, 25, 31, 52

[55] J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur, “Dog breed classifica-

tion using part localization,” in European Conference on Computer Vision.

Springer, 2012, pp. 172–185. 8

[56] J. Liu, Y. Li, and P. N. Belhumeur, “Part-pair representation for part

localization,” in ECCV, 2014. 1, 16, 18, 25, 31, 52

[57] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for

semantic segmentation,” in ECCV, 2015. 19, 36, 53, 57, 62

[58] J. L. Long, N. Zhang, and T. Darrell, “Do convnets learn correspondence?”

in NIPS, 2014. 70, 74

[59] D. G. Lowe, “Object recognition from local scale-invariant features,” in

ICCV, 1999. 4

94

REFERENCES

[60] K. M. Hadi, H. Xufeng, L. Svetlana, B. Alexander, and B. Tamara, “Where

to buy it: Matching street clothing photos in online shops,” in ICCV, 2015.

84

[61] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained

visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013. 10,

11, 49, 50, 55

[62] S. Maji and G. Shakhnarovich, “Part and attribute discovery from relative

annotations,” IJCV, vol. 108, no. 1-2, pp. 82–96, 2014. 3, 10, 50, 52, 55

[63] O. Matan, C. J. Burges, Y. Le Cun, and J. S. Denker, “Multi-digit recog-

nition using a space displacement neural network,” 1995. 57

[64] I. Matthews and S. Baker, “Active appearance models revisited,” IJCV,

vol. 60, no. 2, pp. 135–164, 2004. 52

[65] S. Milborrow and F. Nicolls, “Locating facial features with an extended

active shape model,” in ECCV, 2008. 52

[66] R. Navaratnam, A. W. Fitzgibbon, and R. Cipolla, “The joint manifold

model for semi-supervised multi-valued regression,” in Computer Vision,

2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007,

pp. 1–8. 4

[67] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human

pose estimation,” in CVPR, 2016. 4, 16, 19, 38

[68] M.-E. Nilsback and A. Zisserman, “Automated flower classification over

a large number of classes,” in Computer Vision, Graphics & Image Pro-

cessing, 2008. ICVGIP’08. Sixth Indian Conference on. IEEE, 2008, pp.

722–729. 10, 49

[69] D. Novotny, D. Larlus, and A. Vedaldi, “I have seen enough: Transfer-

ring parts across categories,” in Proceedings of the British Machine Vision

Conference (BMVC), 2016. 36

95

REFERENCES

[70] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,

and K. Murphy, “Towards accurate multi-person pose estimation in the

wild,” in CVPR, 2017. 16, 19, 30, 34

[71] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,”

in CVPR. IEEE, 2012, pp. 3498–3505. 8, 10, 49, 55

[72] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet conditioned

pictorial structures,” in CVPR, 2013. 3, 19

[73] ——, “Strong appearance and expressive spatial models for human pose

estimation,” in ICCV, 2013. 1, 3, 16, 19

[74] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler,

and B. Schiele, “Deepcut: Joint subset partition and labeling for multi

person pose estimation,” in CVPR, 2016. 16, 19

[75] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh, “Pose

machines: Articulated pose estimation via inference machines,” in ECCV,

2014. 1, 16, 19

[76] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-

the-shelf: an astounding baseline for recognition,” in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition Workshops,

2014, pp. 806–813. 74

[77] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time

object detection with region proposal networks,” in NIPS, 2015. 20, 33, 62,

63

[78] E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, and P. Boyes-Braem,

“Basic objects in natural categories,” Cognitive psychology, vol. 8, no. 3,

pp. 382–439, 1976. 3, 10, 50

[79] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,

A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual

recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015. 35, 79

96

REFERENCES

[80] J. Sanchez, F. Perronnin, and Z. Akata, “Fisher vectors for fine-grained

visual categorization,” in CVPR, 2011. 55

[81] J. M. Saragih, S. Lucey, and J. F. Cohn, “Face alignment through subspace

constrained mean-shifts,” in ICCV, 2009, pp. 1034–1041. 52

[82] P. Schnitzspan, S. Roth, and B. Schiele, “Automatic discovery of meaningful

object parts with latent crfs,” in Computer Vision and Pattern Recognition

(CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 121–128. 2, 8

[83] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estimation with

parameter-sensitive hashing,” in null. IEEE, 2003, p. 750. 4

[84] K. J. Shih, A. Mallya, S. Singh, and D. Hoiem, “Part localization using

multi-proposal consensus for fine-grained categorization,” in BMVC, 2015.

16, 19, 25, 31, 38, 52, 53, 73, 74

[85] L. Sigal, R. Memisevic, and D. J. Fleet, “Shared kernel information embed-

ding for discriminative inference,” in Computer Vision and Pattern Recogni-

tion, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 2852–2859.

4

[86] M. Simon and E. Rodner, “Neural activation constellations: Unsupervised

part model discovery with convolutional networks,” in ICCV, 2015. 39, 55,

74, 81

[87] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” in ICLR, 2015. 1, 9, 16, 55

[88] S. Singh, D. Hoiem, and D. Forsyth, “Learning a sequential search for

landmarks,” in CVPR, 2015. 16

[89] M. Stark, J. Krause, B. Pepik, D. Meger, J. J. Little, B. Schiele, and

D. Koller, “Fine-grained categorization for 3d scene understanding,” Inter-

national Journal of Robotics Research, vol. 30, no. 13, pp. 1543–1552, 2011.

10, 49

97

REFERENCES

[90] M. Sun and S. Savarese, “Articulated part-based model for joint object

detection and pose estimation,” in ICCV, 2011. 19

[91] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-

resnet and the impact of residual connections on learning,” arXiv preprint

arXiv:1602.07261, 2016. 9

[92] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in

CVPR, 2015. 1, 2, 9, 16, 17, 20, 23, 55

[93] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking

the inception architecture for computer vision,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–

2826. 9

[94] Y. Tian, C. L. Zitnick, and S. G. Narasimhan, “Exploring the spatial hi-

erarchy of mixture models for human pose estimation,” in ECCV, 2012.

19

[95] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a

convolutional network and a graphical model for human pose estimation,”

in NIPS, 2014. 4, 16, 19, 57

[96] S. Tong and D. Koller, “Support vector machine active learning with appli-

cations to text classification,” Journal of machine learning research, vol. 2,

no. Nov, pp. 45–66, 2001. 39

[97] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep

neural networks,” in CVPR, 2014. 4, 19

[98] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep trans-

fer across domains and tasks,” in Proceedings of the IEEE International

Conference on Computer Vision, 2015, pp. 4068–4076. 39

[99] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep

domain confusion: Maximizing for domain invariance,” arXiv preprint

arXiv:1412.3474, 2014. 39

98

REFERENCES

[100] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Se-

lective search for object recognition,” IJCV, vol. 104, no. 2, pp. 154–171,

2013. 63

[101] R. Urtasun and T. Darrell, “Sparse probabilistic regression for activity-

independent human pose inference,” in Computer Vision and Pattern

Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp.

1–8. 4

[102] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Per-

ona, and S. Belongie, “Building a bird recognition app and large scale

dataset with citizen scientists: The fine print in fine-grained dataset collec-

tion,” in CVPR, 2015. 10, 50

[103] A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji, R. Girshick, J. Kannala,

E. Rahtu, I. Kokkinos, M. B. Blaschko, D. Weiss et al., “Understanding

objects in detail with fine-grained attributes,” in CVPR, 2014. 49, 55

[104] C. Wah, S. Branson, P. Perona, and S. Belongie, “Multiclass recognition

and part localization with humans in the loop,” in ICCV, 2011. 8, 49

[105] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-

ucsd birds-200-2011 dataset,” 2011. 10, 11, 29, 31, 44, 49, 50, 55, 70

[106] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Be-

longie, “Similarity comparisons for interactive fine-grained categorization,”

in CVPR, 2014. 55

[107] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple

granularity descriptors for fine-grained categorization,” in ICCV, 2015. 10,

50

[108] J. Wang, K. Markert, and M. Everingham, “Learning models for object

recognition from natural language descriptions.” in BMVC, vol. 1, 2009,

p. 2. 8

[109] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional

pose machines,” in CVPR, 2016. 16, 19, 38, 53, 64

99

REFERENCES

[110] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and

P. Perona, “Caltech-ucsd birds 200,” 2010. 10, 49

[111] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application

of two-level attention models in deep convolutional neural network for fine-

grained image classification,” in CVPR, 2015. 38, 74, 81

[112] Z. Xu, S. Huang, Y. Zhang, and D. Tao, “Augmenting strong supervision

using web data for fine-grained categorization,” in ICCV, 2015. 49

[113] ——, “Webly-supervised fine-grained visual categorization via deep domain

adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 2016. 38

[114] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept de-

tection using adaptive svms,” in Proceedings of the 15th ACM international

conference on Multimedia. ACM, 2007, pp. 188–197. 39

[115] W. Yang, W. Ouyang, H. Li, and X. Wang, “End-to-end learning of de-

formable mixture of parts and deep convolutional neural networks for hu-

man pose estimation,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2016. 19

[116] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible

mixtures-of-parts,” in CVPR, 2011. 1, 16, 19

[117] ——, “Articulated human detection with flexible mixtures of parts,”

TPAMI, vol. 35, no. 12, pp. 2878–2890, 2013. 70, 75

[118] X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network for object

landmark localization,” arXiv preprint arXiv:1605.01014, 2016. 25, 31, 52,

75, 76

[119] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and

D. Metaxas, “Spda-cnn: Unifying semantic part detection and abstraction

for fine-grained recognition,” in CVPR, 2016. 8, 16, 19, 38, 51, 52, 53, 81

100

REFERENCES

[120] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for

fine-grained category detection,” in ECCV, 2014. 3, 10, 11, 18, 50, 52, 55,

57, 59, 67, 69, 73, 74, 81

[121] N. Zhang, R. Farrell, and T. Darrell, “Pose pooling kernels for sub-category

recognition,” in Computer Vision and Pattern Recognition (CVPR), 2012

IEEE Conference on. IEEE, 2012, pp. 3665–3672. 8

[122] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, “Deformable part descrip-

tors for fine-grained recognition and attribute prediction,” in CVPR, 2013.

18, 38

[123] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, “Panda:

Pose aligned networks for deep attribute modeling,” in CVPR, 2014. 52,

55

[124] N. Zhang, E. Shelhamer, Y. Gao, and T. Darrell, “Fine-grained pose pre-

diction, normalization, and recognition,” arXiv preprint arXiv:1511.07063,

2015. 8, 16, 25, 31, 51, 52, 53, 55, 64, 75, 76

[125] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter

responses for fine-grained image recognition,” in CVPR, 2016, pp. 1134–

1142. 51, 52

[126] X. Zhang, H. Xiong, W. Zhou, and Q. Tian, “Fused one-vs-all mid-level

features for fine-grained visual categorization,” in Proceedings of the ACM

International Conference on Multimedia. ACM, 2014, pp. 287–296. 3, 10,

50

[127] Y. Zhang, X.-s. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N.

Do, “Weakly supervised fine-grained image categorization,” arXiv preprint

arXiv:1504.04943, 2015. 74

[128] F. Zhou, J. Brandt, and Z. Lin, “Exemplar-based graph matching for ro-

bust facial landmark localization,” in Proceedings of the IEEE International

Conference on Computer Vision, 2013, pp. 1025–1032. 1

101

REFERENCES

[129] J. Zhu, X. Chen, and A. L. Yuille, “Deepm: A deep part-based model

for object detection and semantic part localization,” arXiv preprint

arXiv:1511.07131, 2015. 52, 55

[130] L. Zhu, Y. Chen, A. Yuille, and W. Freeman, “Latent hierarchical structural

learning for object detection,” in Computer Vision and Pattern Recognition

(CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1062–1069. 2, 8

[131] X. Zhu and D. Ramanan, “Face detection, pose estimation, and land-

mark localization in the wild,” in Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2879–2886. 1

[132] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from

edges,” in ECCV, 2014. 19, 38

102

REFERENCES

103

Deep Representation Learning for Keypoint localization · Deep Representation Learning for Keypoint...

Documents

Transcript of Deep Representation Learning for Keypoint localization · Deep Representation Learning for Keypoint...