Leveraging Prior-Knowledge for Weakly Supervised Object...

18
International Journal of Computer Vision https://doi.org/10.1007/s11263-018-1112-4 Leveraging Prior-Knowledge for Weakly Supervised Object Detection Under a Collaborative Self-Paced Curriculum Learning Framework Dingwen Zhang 1 · Junwei Han 1 · Long Zhao 1 · Deyu Meng 2 Received: 24 June 2017 / Accepted: 12 August 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract Weakly supervised object detection is an interesting yet challenging research topic in computer vision community, which aims at learning object models to localize and detect the corresponding objects of interest only under the supervision of image-level annotation. For addressing this problem, this paper establishes a novel weakly supervised learning framework to leverage both the instance-level prior-knowledge and the image-level prior-knowledge based on a novel collaborative self-paced curriculum learning (C-SPCL) regime. Under the weak supervision, C-SPCL can leverage helpful prior-knowledge throughout the whole learning process and collaborate the instance-level confidence inference with the image-level confidence inference in a robust way. Comprehensive experiments on benchmark datasets demonstrate the superior capacity of the proposed C-SPCL regime and the proposed whole framework as compared with state-of-the-art methods along this research line. Keywords Weakly supervised learning · Object detection · Self-paced larning 1 Introduction Object detection, the task of finding objects of interest in images via assigning labels to the bounding-box regions, is one of the most fundamental yet challenging tasks in com- puter vision community (Han et al. 2018b). One of the most recent breakthrough on this task was achieved by Girshick et al. (2014), who trained the Convolutional Neural Network (CNN) by using a large number of human labelled objects in bounding-boxes to learn the powerful feature representations and object classifiers. Despite their success, the problem of object detection is still under-addressed in practice due to Communicated by Jakob Verbeek. A preliminary version of this work appeared at IJCAI Zhang et al. (2016). B Junwei Han [email protected] Deyu Meng [email protected] 1 School of Automation, Northwestern Polytechnical University, Xi’an 710072, China 2 Institute for Information and System Sciences, Science and Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China the heavy burden of manually labeling the training samples. Essentially, in the age of big data, humans desire more to be served by intelligent machines which are capable of auto- matically discovering the intrinsic patterns from the cheaply and massively collected weakly labeled images rather than to spend a lot of time and labor costs to manually collect huge amount of finely demarcated objects for training object detectors. Consequently, weakly supervised object detection (WSOD) systems have been gaining more and more research attentions recently. In the WSOD task, only the coarse image-level labels indicating the presence of certain categories of objects need to be provided, while the learner needs to infer accurate object locations from the weakly labelled training images and learn the corresponding object detectors jointly. Essentially, the inference of object locations and the learning of object detectors are a chicken-and-egg problem. The key is how to propagate the image-level supervision to the instance-level (bounding-box-level) training data. As each training image can be decomposed into thousands of bounding-boxes, prop- agating such weak supervision will inevitably involve large amounts of ambiguity as each truthful training instance may be submerged in thousands of noisy training in-stances. For addressing the aforementioned problems in WSOD, the existing methods usually infer the coarse object loca- tions in the learning initialization stage firstly and then 123

Transcript of Leveraging Prior-Knowledge for Weakly Supervised Object...

Page 1: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Visionhttps://doi.org/10.1007/s11263-018-1112-4

Leveraging Prior-Knowledge for Weakly Supervised Object DetectionUnder a Collaborative Self-Paced Curriculum Learning Framework

Dingwen Zhang1 · Junwei Han1 · Long Zhao1 · Deyu Meng2

Received: 24 June 2017 / Accepted: 12 August 2018© Springer Science+Business Media, LLC, part of Springer Nature 2018

AbstractWeakly supervised object detection is an interesting yet challenging research topic in computer vision community, which aimsat learning object models to localize and detect the corresponding objects of interest only under the supervision of image-levelannotation. For addressing this problem, this paper establishes a novel weakly supervised learning framework to leverage boththe instance-level prior-knowledge and the image-level prior-knowledge based on a novel collaborative self-paced curriculumlearning (C-SPCL) regime. Under the weak supervision, C-SPCL can leverage helpful prior-knowledge throughout the wholelearning process and collaborate the instance-level confidence inference with the image-level confidence inference in a robustway. Comprehensive experiments on benchmark datasets demonstrate the superior capacity of the proposed C-SPCL regimeand the proposed whole framework as compared with state-of-the-art methods along this research line.

Keywords Weakly supervised learning · Object detection · Self-paced larning

1 Introduction

Object detection, the task of finding objects of interest inimages via assigning labels to the bounding-box regions, isone of the most fundamental yet challenging tasks in com-puter vision community (Han et al. 2018b). One of the mostrecent breakthrough on this task was achieved by Girshicket al. (2014), who trained the Convolutional Neural Network(CNN) by using a large number of human labelled objects inbounding-boxes to learn the powerful feature representationsand object classifiers. Despite their success, the problem ofobject detection is still under-addressed in practice due to

Communicated by Jakob Verbeek.

A preliminary version of this work appeared at IJCAI Zhang et al.(2016).

B Junwei [email protected]

Deyu [email protected]

1 School of Automation, Northwestern PolytechnicalUniversity, Xi’an 710072, China

2 Institute for Information and System Sciences, Science andMinistry of Education Key Lab for Intelligent Networks andNetwork Security, Xi’an Jiaotong University, Xi’an, China

the heavy burden of manually labeling the training samples.Essentially, in the age of big data, humans desire more to beserved by intelligent machines which are capable of auto-matically discovering the intrinsic patterns from the cheaplyand massively collected weakly labeled images rather thanto spend a lot of time and labor costs to manually collecthuge amount of finely demarcated objects for training objectdetectors. Consequently, weakly supervised object detection(WSOD) systems have been gaining more andmore researchattentions recently.

In the WSOD task, only the coarse image-level labelsindicating the presence of certain categories of objects needto be provided, while the learner needs to infer accurateobject locations from theweakly labelled training images andlearn the corresponding object detectors jointly. Essentially,the inference of object locations and the learning of objectdetectors are a chicken-and-egg problem. The key is how topropagate the image-level supervision to the instance-level(bounding-box-level) training data. As each training imagecan be decomposed into thousands of bounding-boxes, prop-agating such weak supervision will inevitably involve largeamounts of ambiguity as each truthful training instance maybe submerged in thousands of noisy training in-stances.

For addressing the aforementioned problems in WSOD,the existing methods usually infer the coarse object loca-tions in the learning initialization stage firstly and then

123

Page 2: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Fig. 1 Illustration of the motivation of this work. Most existingapproaches (as shown in the dotted line) only use the prior domainknowledge in the learning initialization stage and only considerinstance-level sample selection during the learning iteration, whichlimits their effectiveness for addressing the WSOD problem. In thispaper, we propose a novel collaborative self-paced curriculum learningregime, which can leverage both the instance-level prior-knowledge andthe image-level prior-knowledge in a unified, collaborative, and robustlearning framework

gradually update the object locations as well as the trainedobject detectors in the learning iteration stage (as shownin the gray block in Fig. 1). In both stages, helpful prior-knowledge has been introduced by the existing works tohelp the weakly supervised learning process and encourag-ing learning performance has been obtained. For example,some existing methods utilized the helpful prior-knowledge[e.g., saliency (Siva et al. 2013), objectness (Deselaers et al.2010), and frequent configurations (Song et al. 2014b)] inthe learning initialization stage to provide useful initial train-ing instances for learning the corresponding object detectors.While some others designed constraints or regularizers basedon certain prior-knowledge [e.g., the symmetry prior (Bilenet al. 2014) and themutual exclusion prior (Bilen et al. 2014)]to prevent the learner from over-fitting to the noisy instancesduring the learning iteration stage.

Although prior-knowledge has been playing an importantrole in WSOD, it still lacks effective and rational learn-ing frameworks to leverage it for better solving the WSODproblem. Essentially, the challenging nature of the weaklysupervised learning in WSOD is the contradiction betweenthe coarse-level supervision and the fine-level learning objec-tive. In this scenario, the provided supervision is highlyinadequate, and learning by only relying on such inadequatesupervision would inevitably introduce huge learning ambi-guity into the learning procedure. To overcome this corechallenge, we propose a novel C-SPCL model to leverageboth the instance-level prior knowledge and the image-level

prior knowledge for guiding an explicit confidence sampleinference scheme in a collaborative, unified, and robustframework.

Firstly, in order tominimize the learning ambiguity causedby the noisy instances and the complex images, both theinstance-level confidence inference scheme and the image-level confidence inference scheme are formulated in theproposed learning regime with joint optimization, where theinstance-level confidence weights are used to help infer theimage-level confidence weights and vice versa. This leadsto the collaborative learning framework. Specifically, in theinstance level, C-SPCL is able to select a relatively smallnumber of confident training instances from a large portionof noisy ones (see Fig. 1c), enabling the instance screen-ing capacity. On the other hand, in the image level, C-SPCLaims to assign confidence values to training images accord-ing to their complexity (and therefore their ambiguity) suchthat images with fewer objects and simpler backgrounds areweighted more heavily than those with larger number ofobjects and relatively complex backgrounds (see Fig. 1d).This enables the image weighting capacity. With the afore-mentioned capacities, the proposed WSOD framework canalleviate the learning ambiguity issue that naturally exists inthe weakly supervised learning problem and thus better cap-ture the faithful knowledge of the desired object category.

Secondly, the proposed WSOD framework makes use ofthe prior-knowledge throughout the entire learning proce-dure that includes both the learning initialization stage andthe learning iteration stage, leading to a unified learningframework. Specifically, the proposed C-SPCL model incor-porates the component of curriculum learning (CL) proposedby Bengio et al. (2009) to introduce a pre-defined curriculumto guide the learning procedure of a certain learning task. Itscore idea is to learn the model by starting with easier aspectsof the task, and then gradually taking more complex trainingsamples into consideration. In our approach, the CL com-ponent leverages the obtained prior-knowledge to build thehelpful learning curriculums. Then, the easy instances in easyimages, which are indicated by the built learning curricu-lums, are selected to initialize the learning process, while thecurriculums are also used to build the prior-knowledge reg-ularization term in the proposed learning objective to guidethe subsequent learning process.

Last but not the least, although the prior-knowledge canbring helpful information to the learning process, they areusually too general to fit the concrete learning cases perfectly.Consequently, the proposed learning regime is designed withthe self-paced learning (SPL) component to gradually enrichits own knowledge rather than completely relying on theprovided prior-knowledge. In this way, the learner can con-duct reliable inference even when the prior-knowledge ismisleading and this leads to a robust learning framework.Specifically, SPL was originally proposed by Kumar et al.

123

Page 3: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Fig. 2 Theproposed framework forweakly supervisedobject detection.Given weakly labelled training images, we first decompose each imageinto a bag of instances (object proposals). Then, we leverage the helpfulprior-knowledge to generate the instance-level learning curriculum (inpurple blocks) and image-level learning curriculum (in blue blocks),respectively, and adopt the proposed collaborative self-paced curricu-lum learning (C-SPCL) mechanism to infer the confidence weights of

the training images and instances gradually. After the learning pro-cedure, the labels of the object instances can be used to annotate theobjects, i.e., generating pseudo ground-truth localization, in the trainingimages. Finally, the objects in the test images can be detected by train-ing the CNN-based object detectors based on the pseudo ground-truthannotation

(2010), with the goal to discover truthful knowledge by itselfbased on the knowledge it has already learnt during the learn-ing iterations. In ourmodel, this capability is implemented byintroducing the corresponding self-paced regularizer to helpinfer the confidence weights of each training sample (boththe training instances and the training images) in the learningobjective.

To adopt C-SPCL for WSOD, we establish a powerfulframework as shown in Fig. 2. Specifically, it starts fromthe confident/easy images and instances determined by theprior-knowledge of the image tag complexity and objectprior, respectively, which also constitute the image-level andinstance-level learning curriculums for the subsequent learn-ing process. During the learning procedure of C-SPCL, boththe instance-level and image-level sample confidence will beinferred to screen training instances and weighting trainingimages, respectively, under the consideration of the help-ful prior-knowledge. Then, stronger object detectors can beobtained alongwith the learning iterations. After the learningprocedure, the learner could localize the objects of interest inall training images to generate the pseudo ground-truth anno-tation. Then, like some recently proposed WSOD approach,e.g., Kumar Singh et al. (2016), we train convolution neuralnetworks (CNN) based on the pseudo ground-truth annota-tions to build stronger object detectors and finally apply theobtained detectors to test procedure.

To sum up, the contributions of this paper are mainly four-fold:

– We make the earliest effort to collaborate the instance-level confidence inference and the image-level con-fidence inference with joint optimization process foralleviating the learning ambiguity existed in the weaklysupervised learning procedure of WSOD.

– A new way for making better use of the acquirableprior-knowledge (the instance-level prior-knowledge andimage-level prior-knowledge) for WSOD is proposed,which builds helpful learning curriculums based onthe prior-knowledge to guide the confidence inferenceregime throughout the entire learning procedure.

– For pursuing a robustweakly supervised learning scheme,self-paced learning mechanisms (both in instance leveland image level) are also embedded in the proposed learn-ing framework to assist the confidence inferencewhen thelearning curriculums built on prior-knowledge are inac-curate and misleading.

– Wepropose a novelC-SPCLmodel to realize such collab-orative, unified, and robust weakly supervised learningframework via a concise but effective formulation. Com-prehensive experiments onwidely used benchmarks havedemonstrated the rationality of each of the consideredcomponent and the superiority of the entire approachas compared with state-of-the-art methods along thisresearch line.

Thework in this paper is a substantial extension of our pre-liminary study in Zhang et al. (2016). Compared with Zhanget al. (2016), the major differences in this paper include:(1) We propose a different and more powerful weaklysupervised learning mechanism in this paper, i.e., the col-laborative self-paced curriculum learning. It can additionallyinfer the sample confidence in image level, collaborate theimage-level confidence and instance-level confidence, andexplicitly formulate three different helpful prior-knowle- dgein its objective. (2) We establish an upgraded framework forlearning weakly supervised object detectors by additionallyexploring effective ways to generate the image-level learn-ing curriculum and the instance-level learning curriculum

123

Page 4: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

as well as using more advanced deep network. (3) Compre-hensive experiments onmore benchmark datasets, with morecomparison methods and more detailed ablation studies havebeen conducted to demonstrate the effectiveness and the cor-responding rationality of the proposed approach.Notably, theproposed approach can obtain more than 15% performancegain as compared with Zhang et al. (2016).

The rest of this paper is organized as follows. Section 2introduces related works. Section 3 presents the proposed C-SPCL model. Section 4 describes the details of the proposedWSOD framework. Section 5 shows experimental results tosubstantiate the effectiveness of the proposedmethod and fur-ther analyze the factors considered in our approach. Finally,conclusions are drawn in Sect. 6.

2 RelatedWorks

2.1 Weakly Supervised Object Detection

Weakly supervised object detection is an interesting yet chal-lenging task in the computer vision community, which hasbeen studied for more than 10years. The early research onWSODwas mainly implemented on some easy datasets con-taining objects occupying a large portion of the image, whichhas been summarized in Deselaers et al. (2012). In this sec-tion, we mainly focus on introducing the progress onWSODin the recent decade, which has facilitated effective systemseven on the challenging PSCAL VOC datasets.

The first wave of the recent research on WSOD can betracked back to about 8years ago, when the main-streammainly focused on the way to provide satisfactory initial-ization of the learning framework. Specifically, Galleguilloset al. (2012) proposed to extract stable segmentations toincrease the chances of extracting meaningful objects formultiple-instance learning (MIL). Siva and Xiang (2011)proposed to introduce the objectness, intra-image similarity,and the inter-image variance into the instance initializationstage, and further emphasize the importance of the inter-classvariance in Siva et al. (2012). Afterwards, Siva et al. (2013)also developed an unsupervised saliency detection methodand extracted the initial training samples from the gener-ated saliency maps. Wang et al. (2014b) and Song et al.(2014b) proposed to discover the latent semantics and thefrequent configurations of discriminative visual patterns intheir initialization stages, respectively. These works havedemonstrated the importance of exploring prior-knowledgefor training sample initialization. However, how to exploresuch prior-knowledge to explicitly guide the updating of theobject detectors and locations during the subsequent iterativelearning stage remains to be under-studied.

In order to improve the optimization procedure in updatingthe object detectors and the corresponding object locations,

researchers started the second wave of study on WSODin the recent 5years. Specifically, Deselaers et al. (2012,2010) inferred one object hypothesis per image by opti-mizing the energy of the proposed conditional random field(CRF) defined globally over all training images. They usedthe objectness scores (Alexe et al. 2010) to provide help-ful prior-knowledge through the entire learning procedure.However, due to the lack of effective instance screening andimage weighting capacity, this method cannot work wellin relatively challenging scenarios. Bilen et al. (2014) pro-posed to use prior-knowledge to drive the latent variables(i.e., the instance locations) in the proposed Latent Struc-tural SVM bymeans of posterior regularization. The specificprior-knowledge used by them is different from that used inour approach and, more importantly, the prior-knowledgeused in our approach is to build learning curriculums forguiding a novel confidence-weighting based weakly super-vised learning scheme. Furthermore, Gokberk Cinbis et al.(2014) proposed a multi-fold MIL procedure, which canavoid the rapid convergence to poor local optima and handlethe high-dimensional representations during iterations.Morerecently, Bilen et al. (2015) proposed to couple a smoothdiscriminative learning procedure with a convex clusteringalgorithm, which enforced the local similarity of the selectedinstances during optimization. Shi et al. (2015) exploited theknowledge between co-occurring object categories via theBayesian latent topicmodel to learn the appearancemodels ofmultiple object categories jointly. Ren et al. (2016) proposeda bag-splitting algorithm that iteratively generates new nega-tive bags from positive ones to reduce the learning ambiguity.Bilen andVedaldi (2016) proposed aweakly supervised deepdetection architecture that modified the conventional image-level network to operate at the level of image regions. Jieet al. (2017) designed an online supportive sample harvest-ing scheme to dynamically select the most confident tightpositive samples and train the detector in a mutual boostingway. This wave of study has proposed many valuable con-siderations to improve the learning performance effectively.However, limited attempt has been made to simultaneouslyinfer image-level confidence during the learning procedure.

2.2 Curriculum Learning and Self-Paced Learning

In curriculum learning (CL), the curriculum is usuallyassumed to be given by an oracle beforehand, and remainsfixed throughout the subsequent learning procedure. Here,the curriculum is often defined to determine a sequence oftraining samples which essentially corresponds to a prioritylist ranked in ascendingorder of learningdifficulty. For exam-ple, for classifying geometrical shapes, Bengio et al. (2009)utilized a ranking function based on the variability in shape.Under such curriculum, the shapes exhibiting less variabilitywould be learned earlier. Similarly, Khan et al. (2011) pro-

123

Page 5: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

posed to teach a robot the concept of graspability by askingparticipants to assign a learning sequence of graspability forvarious objects based on their common sense. For approach-ing grammar induction, Spitkovsky et al. (2009) used thelength of sentences to define the curriculum by following theheuristics that short sentences are easier to analyze and thusshould be learned earlier. Chen and Gupta (2015) presenteda two-step approach for utilizing weakly labelled web data totrain object detectors, which is essentially a curriculum learn-ing framework. Similarly, Shi and Ferrari (2016) proposedto build a learning curriculum for guiding the training pro-cedure of their object detectors by automatically estimatingthe sizes of the objects contained in each training image. Theexisting framework might not work well (especially underthe weak supervision) as the learner is designed to dogmat-ically trust the pre-defined learning curriculum which mightnot always be the optimal ones in various cases.

Different from CL, SPL aims to generate curriculum bythe learner itself. In the past few years, the effectivenessof such learning regime has been validated in a numberof tasks (Supancic and Ramanan 2013; Tang et al. 2012;Jiang et al. 2014a; Han et al. 2017; Zhang et al. 2017a, b;Meng et al. 2017). For example, Supancic and Ramanan(2013) used the formalism of SPL to automatically learnrobust appearance model in object tracking. Kumar et al.(2010) made the earliest effort to adopt SPL regime to solvethe weakly supervised learning problem. However, it lacksthe capability of incorporating helpful prior-knowledge aswell as some important factors like the sample divergenceand the instance-image inference interaction for accelerat-ing confident evidence during the learning procedure, whichprevents them from obtaining good performance in challeng-ing benchmarks like PASCALVOC. For accommodating thehidden information of the samples into the learning proce-dure, Tang et al. (2012) proposed to adaptively select easysamples in each iteration to learn the powerful dictionary.In multimedia event detection, Jiang et al. (2014a) proposedself-paced re-ranking models by introducing a non-convexregularizer to select reliable training samples. For addressingthe problems in co-saliency detection, Zhang et al. (2017a)combined the SPL and multiple instance learning (MIL) intoa unified framework to mine the common patterns of the co-occurring saliency regions.

More recently, Jiang et al. (2015) proposed a generallearning paradigm, which combines the merits from boththe CL and SPL by constraining the parameter space of theself-paced regularizer into a small region determined by thepre-defined learning curriculum. However, this model wasdesigned to work under the fully annotated training data andthus cannot meet the requirement for theWSOD task investi-gated in this paper. In particular, by dogmatically constrain-ing the parameter space based on the prior-knowledge, theirmethod requires to leverage very precise prior-knowledge,

which, however, can be hardly obtained in our task. On theother hand, the learning regime proposed by Jiang et al.(2015) can only conduct confidence inference at one level,which cannot be used to collaborate the confidence inferenceat image level and instance level in our task. Consequently,the C-SPCL model proposed in this paper has distinct prop-erties as compared to the existing work (Jiang et al. 2015),especially in terms of the collaborative and robust propertiesof the learning model.

3 The Collaborative Self-Paced CurriculumLearning

3.1 Problem Formulation

Our aim is to learn the object detectors of multiple objectcategories to localize them in the weakly-labelled train-ing images. We formulate this problem as follows. GivenK images, consider the bounding-box object instancesextracted in each training image as the instances to be local-ized, and the classifier parameters W = {wc}Cc=1, b ={bc}Cc=1 as the object models of the corresponding C cate-gories to be learned. Accumulating all instances at the k-thimage obtains Xk = {x(k)

i }nki=1, k = 1, 2, . . . , K , where

x(k)i ∈ R

d corresponds to the feature representation of thei-th instance of the k-th image, nk is the instance numberin Xk , and d is the feature dimension. y(k)

i,c ∈ {−1, 1} and

v(k)i,c ∈ [0, 1] indicate the label and the confidence weight of

x(k)i , respectively, for the c-th object category. Here we alsoinvolve the variable uk ∈ [0, 1] to indicate the confidenceweight of the k-th image into the learning regime. Thus, W,b, y = {y(k)

i,c }, v = {v(k)i,c }, and u = {uk} are all the parame-

ters that need to be optimized during the proposed learningregime. Note that only weak labels of each training image{Yk} are given to indicate whether they contain the instancesthat belong to the certain object categories.

To solve this problem, we propose a novel C-SPCLparadigm. The main idea is to establish a unified, collabora-tive, and robust framework that can combine CL with SPLas well as involve the confidence inference both in the imagelabel and the instance level to make the whole learning sys-tem work under the weak supervision. Essentially, C-SPCLtends to firstly distinguish faithful knowledge from confi-dent instances in easy images, and then gradually adapt thelearned knowledge to learn more ambiguous instances frommore complex images. The learner inC-SPCLcould combinethe helpful prior knowledge from the learning curriculum andits own knowledge to infer sample confidence for its subse-quent learning procedure. To realize such idea, we formulatethe problem as the following optimization problem:

123

Page 6: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

minW,b,y,v,u

E(W,b, y, v,u) = Eins(W,b, y, v)

+ Eima(W,b, y, v,u)

+ Eprior (v,u),

s.t . y(k)i,c ∈ {−1, 1}C∑

c=1

|y(k)i,c + 1| ≤ 2,

nk∑

i=1

|y(k)i,c∗ + 1| ≥ 2; if c∗ ∈ Yk,

(1)

where Eins , Eima , and Eprior are the instance-level self-paced learning term, the image-level self-paced learningterm, and the prior-knowledge regularization term, respec-tively. More explanations on these terms will be given indetail later. Three constraints are imposed on labels y. Thefirst one constrains the label to be binary in each sub-classification problem. The second one, i.e.

∑Cc=1 |y(k)

i,c +1| ≤ 2, enforces each instance to be with only one objectcategory, or no class, i.e., the background category. Thisconstraint inherently penalizes the indiscriminative objectinstances, i.e. the instances predicted to belong to multipleobject categories, when calculating their confidence weights.The third one, i.e.

∑nki=1 |y(k)

i,c∗ + 1| ≥ 2, means that for allobject instances located in the k-th image, at least one of themshould belong to a certain object category weakly annotatedin Yk . This will make the learned result more finely complywith the standard weakly supervised learning paradigm.

3.2 The Instance-Level Self-Pace Learning Term

In (1), the instance-level self-paced learning term is de-finedas:

Eins(W,b, y, v) = Lins(W,b, y, v) + F(v; λ), (2)

where Lins(W,b, y, v) and F(v; λ) are the instance-levelweighted loss term and the instance-level self-paced reg-ularizer, respectively. λ = {λc}Cc=1 is the class-specificparameters imposed on the regularization terms in F(v; λ).Specifically, the instance-level weighted loss term is definedas:

Lins(W,b, y, v)=C∑

c=1

K∑

k=1

nk∑

i=1

v(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

))

�(y(k)i,c , f

(x(k)i ;wc, bc

))=

(1 − y(k)

i,c

(wTc x

(k)i + bc

))

+ ,

(3)

which is essentially the sumof hinge loss of the instances inCsub-classification problemswith considering the correspond-

ing confidence weights. The instance screening capability ofthe proposed learning regime is realized by the involvementof F(v; λ) with the following form:

F(v; λ) = −C∑

c=1

λc

K∑

k=1

nk∑

i=1

v(k)i,c . (4)

Such negative l1-norm term is inherited from the conven-tional SPL (Kumar et al. 2010), which favors selecting easyover complex instances. Thus, we call it the easiness term.With this term, the regularizer F(v; λ) would conduct either1 or 0 (i.e. being selected in training or not) for the weightv

(k)i,c imposed on instance x(k)

i , by judging whether its loss

value is smaller than the pace parameter λc or not.1 That is,a sample with smaller loss is taken as an easy sample andthus should be learned preferentially and vice versa, whichnaturally complies the instance screening.

3.3 The Image-Level Self-Pace Learning Term

The image-level self-paced learning term is defined as:

Eima(W,b, y, v,u) = Lima(W,b, y, v,u) + H(u; η), (5)

where Lima(W,b, y, v,u) and H(u; η) are the image-levelweighted loss term and the image-level self-paced regular-izer, respectively, and η is the class-independent parameterimposed on the regularization term of H(u; η). Specifically,the instance-level weighted loss term is defined as:

Lima(W,b, y, v,u) =K∑

k=1

uk · ε(Xk, y(k),W,b, v

), (6)

where ε(Xk, y(k),W,b, v) is used to indicate the predictionerror for each training image. In this paper, we define it asthe average loss of the instances residing in a certain image:

ε(Xk, y(k),W,b, v

)

= 1

nk

∑C

c=1

∑nk

i=1v

(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

)),

(7)

To realize the imageweighting capability of the pro-posedlearning regime to obtain the image-level confidence weight,we propose to apply the image-level self-paced regularizerwith the following form:

H(u; η) = η

(1

2||u||22 −

∑K

k=1uk

), (8)

1 The instance-level confidence values finally inferred by the proposedapproach are real numbers ranging from 0 to 1 as prior-knowledge termswill also be involved in the optimization procedure.

123

Page 7: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

which is essentially the linear soft self-paced regularizer asadopted in Jiang et al. (2014a). It can assign linear soft con-fidence weights to the training images with respect to theimage-level weighted loss, i.e., the training images with lessimage-level weighted loss are encouraged to be selected asthe more confident ones and vice versa.

Our image-level learning curriculum is inspired by Chenand Gupta (2015) but has significant difference from Chenand Gupta (2015). Specifically, Chen and Gupta (2015)introduced a fixed image-level learning curriculum into thelearning procedure, which was defined based on the sourcesof the online visual data.On the contrary, our learningmecha-nism explicitly models the image-level sample confidence byusingEima as well as the image-level curriculum term PimgC

(as introduced in the Sect. 4.1). The real valued confidenceweights of the training images will be inferred dynamicallyalong with the learning iterations.

3.4 The Prior-Knowledge Regularization Term

The prior knowledge term involved in (1) is defined as:

Eprior (v,u) = PinsD(v; λ)

+ PinsC (v,u; λ,S) + PimgC (u; η, g)(9)

where

PinsD(v; λ) = −C∑

c=1

λc

K∑

k=1

√√√√nk∑

i=1

v(k)i,c

PinsC (v,u; λ,S) = −C∑

c=1

λc

K∑

k=1

nk∑

i=1

v(k)i,c s

(k)i,c uk

PimgC (u; η, g) = −η

K∑

k=1

gkuk,

(10)

are the instance diversity regularizer, the instance-levelcurriculum regularizer, and the image-level curriculum regu-larizer, respectively. S = {s(k)

i,c } indicate the priority values ofeach instance in the instance-level curriculum, and g = {gk}indicate the priority values of each image in the image-levelcurriculum.

The instance diversity regularizer introduces the prior-knowledge of sample diversity into the inference of theinstance-level confidence weight. As shown in Fig. 3, inobject detection, an identical object would be contained inmultiple overlapping bounding-box object proposals. Theseproposals might have the similar feature representation andwould obtain the similar prediction scores. Thus, selectingtraining samples without considering sample diversity wouldinvolve such redundant samples into learning and therebyinfluence the effectiveness of the learnt object detectors.

Fig. 3 Examples to show the reasonability of considering sample diver-sity in the instance-level confidence inference. As shown in the secondrow, selecting samples without using the instance diversity term wouldintroduce more redundancy into the selected training instances

To this end, we adopt the l0.5,1 norm regularizer, whichfavors selecting diverse instances residing in more images.This property could be understood as its negative leads tothe group-wise sparse representation of v. Consequently,PinsD(v; λ) has the counter-effect to the group-wise spar-sity. That is, minimizing this diversity term tends to dispersenon-zero elements of v over more groups. As in our for-mulation, each image is considered as an individual group,this term would favor selecting diverse samples in differentimage scenes. Notice that different from the l2,1 norm uti-lized in Jiang et al. (2014b), the group-sparsity term used inthis paper is concave, leading to the convexity of its nega-tive. This on one side simplifies the designation of the solvingstrategy, and on the other side well fits the previous axiomicdefinition for the SPL regularizer (Jiang et al. 2014a; Zhanget al. 2017a).

The instance-level curriculum regularizer collaborates theinstance-level learning priority and image-level learning pri-ority to guide the inference of the instance-level sampleconfidence. As can be seen, it has the following two proper-ties. 1) In each training image, the instances assigned withhigher instance-level learning priority values are encouragedto have larger confidence weights. By using this term, thelearner could take account of the instance-level learning cur-riculum to infer the instance-level sample confidencewithoutstrictly constraining the parameter space of the self-pacedregularizer, leading to more robust learning performancewhen the prior knowledge cannot fit to the practical situationswell. 2) The instances in more confident training images, i.e.,images with larger uk , are encouraged to have larger confi-denceweights as comparedwith the instances residing in lessconfident training images. This intuition can be interpreted asthe confident images contain relatively easy image contentsand thus the instances residing in them will be encounteredwith less learning ambiguity.

The image-level curriculum regularizer introduces theimage-level learning priority to guide the inference of theimage-level sample confidence. FromFig. 4a,we can observe

123

Page 8: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Fig. 4 Examples to illustrate the generated image-level learning cur-riculum and instance-level learning curriculum, respectively. From (a)we can observe that the image-level priority generated by the imagetag complexity (as introduced in Sect. 4) can, to some extent, reflectthe complexity of the image content. While, from (b), we can observethat the instance-level priority can reflect the probability of such image

regions to be the object of interest. Notice that in the images with smallimage-level priority [the left example in (b)], even the instances withhigh instance-level priority could not perfectly localize the object. Thus,we introduce the interaction term of the image-level priority and theinstance-level priority as in (10)

that the imageswith smaller image-level learningpriority val-ues would contain more complex image contents, i.e., morecategories of objects locating in various image regions, andthus should have higher ambiguity for learning the objectdetectors. Similar to the instance-level curriculum regular-izer, the learner using this term could take account of thelearning cur-riculum established based on the acquirableimage-level prior-knowledge to guide a relatively robustinference of the image-level sample confidence.

3.5 Optimization Strategy

The solution to (1) can be approximately attained via thealternative search strategy which alternatively optimizes theinvolved parameters W, b, y, v and u as described in Algo-rithm 1. After initializing y, v and u based on the task-relateddomain knowledge as described in Sect. 4, our optimizationstrategy mainly contains the following steps:

Optimize W, b under fixed y, v and u This step aims toupdate the classifiers for detecting the objects belonging tothe corresponding object categories. In this case, (1) degen-erates to the following form:

minW,b

C∑

c=1

(K∑

k=1

nk∑

i=1

(1+uk

nk

)v

(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

))),

(11)

which can be equivalently reformulated as solving the fol-lowing sub-optimization problems for each c ∈ [1,C]:

minWc,bc

K∑

k=1

nk∑

i=1

(1 + uk

nk

)v

(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

)).

(12)

Algorithm 1: The C-SPCL algorithm.input : Training images with weak labels, extracted X, instances

instance-level curriculum S, image-level curriculum G,and model parameters λ, η;

output: Object detectors {W,b}, instance labels y;

1 Initialize pseudo instance labels y, instance confidence weights v,and image confidence weights u;

2 while not converge do3 Update {W,b} via one-vs-all weighted SVM;4 Update y via Algorithm 2;5 Update u via optimizing Equation (14);6 Update v via optimizing Equation (16);

7 return {W,b} and y.

This is a standard one-vs-all (weighted) SVM model (Yanget al. 2007).

Optimize y under fixed W, b, v and u The goal of thisstep is to learn the pseudo-labels of training instances fromthe current object detectors. The model in this case could bereformulated as:

miny

C∑

c=1

(K∑

k=1

nk∑

i=1

(1+ uk

nk

)v

(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

)))

s.t . y(k)i,c ∈ {−1, 1}

∑C

c=1|y(k)

i,c + 1| ≤ 2,∑nk

i=1|y(k)

i,c∗ + 1| ≥ 2; if c∗ ∈ Yk .

(13)

This problem can be equivalently decomposed into sub-problemswith respect to each k = 1, . . . , K . For each image,the global optimum of y(k) can be attained by Algorithm 2,which could be derived from the theorem in Zhang et al.(2017a).

123

Page 9: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Optimize u under fixed y, v and W, b After updatingthe object detectors and the instance labels, we renew theweights on all training images to reflect their different confi-dence for identifying the objects within them. In this case, (1)degenerates to the following form:

minu

−K∑

k=1

(uk

C∑

c=1

nk∑

i=1

λcv(k)i,c s

(k)i,c

)

+ η

(1

2||u||22 −

K∑

k=1

(1 + gk)uk

)

+K∑

k=1

(uknk

C∑

c=1

nk∑

i=1

v(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

)))

(14)

which is convex and thus could be efficiently solved by usingsome off-the-shelf optimization techniques, e.g., the CVXtoolbox, to finely approach its global solution.

Optimize v under fixed W, b, y and u After determin-ing the image-level sample confidence, we further infer thefiner instance-level sample weights to reflect their differentconfidence to learning of the current decision surface. In thiscase, (1) degenerates to the following form:

minv

C∑

c=1

{K∑

k=1

nk∑

i=1

v(k)i,c �

(y(k)i,c , f

(x(k)i ;wc, bc

))

−λc

⎣K∑

k=1

nk∑

i=1

v(k)i,c +

K∑

k=1

√√√√nk∑

i=1

v(k)i,c (15)

+K∑

k=1

nk∑

i=1

v(k)i,c s

(k)i,c uk

⎫⎬

⎭ ,

which is equivalent to independently solving the followingsub-optimization problem for each k = 1, . . . , K and c =1, . . . ,C via:

minv

(k)i,c ∈[0,1],i=1,...,nk

nk∑

i=1

v(k)i,c �

(y(k)i,c , f (x(k)

i ;wc, bc))

− λc

⎝nk∑

i=1

v(k)i,c +

√√√√nk∑

i=1

v(k)i,c + uk

nk∑

i=1

v(k)i,c s

(k)i,c

⎠ ,

(16)

which is also convex and thus could be efficiently solvedby utilizing the CVX toolbox to finely approach its globalsolution.

Ultimately, the whole alternative search process can thenbe summarized as in Algorithm 1. According to Kumaret al. (2010), such an alternative search algorithm convergesas the objective function E(W,b, y, v,u) is monotonicallydecreasing and is bounded from below.

4 Weakly Supervised Object Detection basedon C-SPCL

Given weakly labelled training images, we first decomposeeach image into a bagof instances byusing theEdgeBox (Zit-nick and Dollár 2014). Then, we design the following stepsto implement weakly supervised object detection.

4.1 Generating Image-Level Learning Curriculum

As mentioned before, learning curriculum determines asequence of learning samples which essentially correspondsto a rank of learning priority in descending order. Thus, inthis paper,wegenerate the image-level learning curriculum toreflect the learning priority for each training image. DifferentfromChen’s approach (Chen andGupta 2015),which definedthe learning curriculum based on their prior knowledge of thecleanness of the visual data from different search engines,we propose to define the learning curriculum by countingthe number of tags for each image, i.e., gk = exp{−|Yk |},where |Yk | indicates the number of tags for the k-th image.This is based on the fact that the images containing multipleobject categories will have larger ambiguity during the learn-ing procedure and thus should be considered asmore difficulttraining samples. To this end, only the images containing oneobject category are considered as the easy ones to start thelearning procedure of the proposed approach. Notice thatthe defined learning curriculum could only provide a coarsepriority for the training images due to the limited or eveninaccurate understanding of the data. However, it would stillbenefit for the learning procedure as the learning curriculumis formulated as a regulation term to guide the subsequentlearning procedure rather than being formulated as a con-straint to strictly restrict the subsequent learning procedure,which leads to a robust learning performance when the priorknowledge cannot fit to the practical situations well.

4.2 Generating Instance-Level Learning Curriculum

To generate the instance-level learning curriculum, weexplore the class-specific object prior s(k)

i,c by using the mask-out strategy (Li et al. 2016). Specifically, for the traininginstance x(k)

i , we calculate s(k)i,c as:

s(k)i,c = p2c−1

(x(k)i |Ik

))− p2c−1

(x(k)i |Ik

)), (17)

where φ(x(k)i |Ik) and ψ(x(k)

i |Ik) indicate the sub-image

formed by the bounding-box region of x(k)i within the image

Ik and the mask-out image formed by replacing the pixelvalues within the bounding-box region of x(k)

i with the fixedmean pixel values pre-computed on ILSVRC 2012, respec-tively. p2c−1(·) indicates the 2c − 1 dimension of the output

123

Page 10: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Algorithm 2: Algorithm for optimizing y(k).

input : Instances {x(k)i }nki=1, object detector {W,b}, instance

confidence weights {v(k)i,c }, and the weak image label Yk ;

output: The pseudo instance labels y(k);

1 for i = 1 to nk do2 if wT

c x(k)i + bc < 0 for all c = 1, . . . ,C

3 then y(k)i,c = −1 for all c = 1, . . . ,C ;

4 otherwise

5 y(k)i,c = 1 for c = c,

6 y(k)i,c = −1 for c �= c,

7 where c = argmax wTc x

(k)i + bc ;

8 if∑nk

i=1 |y(k)i,c∗ + 1| < 2, where c∗ ∈ Yk ; then

9 for i = 1 to nk do10 if

∑Cc=1 |y(k)

i,c + 1| = 011 then case=0,

12 Δi =(1 + uk

nk

)v

(k)i,c∗�(1, f (x(k)

i ;wc∗ , bc∗ )) ;

13 otherwise case=1,

14 Δi =(1 + uk

nk

)v

(k)i,c∗�(1, f (x(k)

i ;wc∗ , bc∗ ))

15 +(1 + uk

nk

)v

(k)i,c �(1, f (x(k)

i ;wc, bc)),

16 where c = argmax wTc x

(k)i + bc ;

17 i∗ = argmini=1,...,nk

Δi ;

18 if case=0, then y(k)i∗,c∗ = 1;

19 if case=1, then y(k)i∗,c∗ = 1, y(k)

i∗,c = −1;

20 return y(k).

p of a classification network (Li et al. 2016). The networkis essentially the AlexNet (Krizhevsky et al. 2012) withthe modified loss layer. Specifically, instead of using the c-dimensional multi-class loss layer as in the original AlexNet,the adopted network uses the 2c- dimensionalmulti-label losslayer by transforming the original training label r ∈ {0, 1}Cto a new label t ∈ {0, 1}2C , where

t2c−1 ={1, rc = 1

0, rc = 0and t2c =

{0, rc = 1

1, rc = 0. (18)

In this way, each odd entry of t represents whether the imagecontains the corresponding object and each odd entry of prepresents the predicted probability that the image containsat least one object instance of the corresponding category.Consequently, in (17), the instance which is more likely tobe an object of the c-th object category will obtain a highers(k)i,c .

4.3 Learning via C-SPCL

Based on the defined learning curriculum, we initialize theimage-level confidence weights uk of the easy images asones and others as zeros. Then, we extract the initial training

instances from the easy images, which are relatively morelikely to be the real objects of interest. Specifically, afterobtaining the object priors S of the training instances byadopting the mask-out strategy (as introduced in Sect. 4.2),we use the same strategy with Li et al. (2016) to select thetop 50 object proposals as the initial training instances. Forinitializing the confidence weight v, we assign its value to thecorresponding object prior for the selected training instancesand zero for others. The instance label y is initializedaccording to the predicted categories of the correspondingclass-specific object prior predictor. After obtaining S andinitializing y, v, and u, we adopt the proposed C-SPCL tolearn the object detectors of multiple object categories viaAlgorithm 1.

4.4 Object Annotation and Detection

After finishing the learning iteration, we localize the objectsin each training image based on the obtained instance labely. Then, we follow Kumar Singh et al. (2016) to treat thelocalized instances in the training images as the pseudoground-truth to train the object detector, which is based onthe fast R-CNN system (Girshick 2015). Specifically, thenetwork is initialized by the VGG16 net (Simonyan and Zis-serman 2014) pre-trained on ImageNet, and then fine-tunedby using the pseudo ground-truth instances as the positivetraining samples while any other instance that has an IOUless than 0.5 and greater than 0.1 with a pseudo ground-truthinstance as the negative ones. In this way, the fast R-CNNmodel can fine-tune its feature representation based on thepseudo ground-truth and thus build strong object detectorswithout using any human labeled object bounding-boxes.Finally, in testing, the learned CNN model is used to gener-ate prediction scores for each proposal. Then non-maximumsuppression is applied to removing those detections that over-lap more than 50% with the top scoring detection.

5 Experimental Evaluation

5.1 Experimental Settings

Following the previous works (Bilen et al. 2015; Siva et al.2012; Gokberk Cinbis et al. 2014), the experimental resultsare evaluatedmainly based on two criteria. The first one is thecorrected localization rate (Cor-Loc) which is widely usedto evaluate the object annotation/localization performance.CorLoc is the percentage of images that contain at least oneinstance of the target object category for which the mostconfident instance should be localized correctly. While thesecond one is the mean average precision (mAP), which isthe standard evaluation protocol used to evaluate the objectdetection performance. For both criteria, a bounding-box is

123

Page 11: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

considered to be correctly localized or detected if it has anintersection-over-union ratio of at least 50% with a ground-truth object instance.

We evaluate our method on Pascal VOC Trainval-2007(Everingham et al. 2007), Test-2007 (Everingham et al.2007), Test-2010 (Everingham et al. 2010) benchmarks, andsubsets of the training and validation set of COCO-2014 (Linet al. 2014), which consist of the airplane, bus, cat, dog, andtrain category.2 In our experiments, there are two hyperpa-rameters (i.e., λ and η) that need to set manually. Notice that,due to the lack of precise ground-truth, such hyperparameterscan hardly be tuned under the setting of the weakly super-vised learning. Thus, we set these hyperparameters based ontheir physical meanings. Specifically, we followed previousworks (Jiang et al. 2014a, b) to set λc according to the roughexpectation of the number of the selected samples. For eachobject category, our rough expectation of the number of theselected samples was 30% (i.e., 0.3) of the total number ofthe instances predicted as belonging to the correspondingobject category.3 This is due to the fact that only small por-tion of the training instances would confidently contain theobjects of interest. We also set η = 2 to roughly balance theimportant between the instance-level prior-knowledge andthe image-level prior-knowledge as the former contains tworegularization terms in (10) while the latter only contains oneregularization term. Although these hyperparameters wereonly set roughly and without intentionally fitting to the testdata, the proposed approach can already obtain encouragingresults in the following experiments.

5.2 Analysis of the Learning Regimes

In the first part of our experiments, we implemented com-prehensive studies to demonstrate the effectiveness of theproposed C-SPCL regime. The experiments were conductedon the VOC Test-2007 Set, which compared the proposed C-SPCL regime with five baseline learning regimes describedas follows:

– only CL The first baseline learned weakly supervisedobject detectors by only using the image-level learn-ing curriculum. This was implemented by first using the

2 We used the images labelled as containing the airplane, bus, cat, dog,and train category from the training set of COCO (Lin et al. 2014) toform the sub training set of COCO (totally 13034 images) that is used inour experiments. Similarly, we used the images labelled as containingthe airplane, bus, cat, dog, and train category from the validation set ofCOCO (Lin et al. 2014) to form the sub validation set of COCO (totally6309 images).3 We set λ equaling to the 30%-th instance’s loss value�(y(k)

i,c , f (x(k)i ;wc, bc)) (ranked from low to high). Notice that as the

loss values of the instances and the number of instance in differentobject categories are not the same, the concrete λ values for differentcategories are different. Thus, we use λc in this paper.

training instances with top 50 object priors in each easyimage as the initial training instances (which is consistentwith the initialization stage of the proposed approach) totrain a SVM classifiers. Then, the instances predicted asthe objects of interest in more complex images (indicatedby the image-level learning curriculum) were graduallyadded into the learning procedure to update the SVMclassifiers. As the image-level learning curriculum onlycontained several discrete priority values (see Fig. 4), weadded training images with a smaller priority value intoeach learning iteration.

– only SPLThis baseline learnedweakly supervised objectdetectors by only using the self-paced learning compo-nent in instance level. It was implemented by directlyusing the instance-level self-paced learning term Eins .

– w/o INP This baseline learned weakly supervised objectdetectors based on the proposed C-SPCLmodel but with-out leveraging the instance-level prior-knowledge. It wasimplemented by learning with Eins + Eima + PimgC .

– w/o IMP This baseline learned weakly supervised objectdetectors based on the proposed C-SPCLmodel but with-out leveraging the image-level prior- knowledge. It wasimplemented by learning with Eins + Eima + PinsD +PinsC .

– full C-SPCL The full system of the proposed C-SPCLlearning regime implemented by using Eins + Eima +Eprior .

The experimental results reported in Table 1 have demon-strated the following insights: (1) according to the experi-mental results of only CL and only SPL, learning weaklysupervised object detectors by only using either the CLcomponent or the SPL component cannot obtain satisfac-tory performance because either overly depending on theprior-knowledge or totally ignoring the prior-knowledge isnot appropriate way for solving the WSOD problem. (2)The experimental results of w/o INP and its comparisonwith full C-SPCL demonstrate the importance to leveragethe instance-level prior-knowledge in the WSOD task andthe effectiveness of the proposed learning regime to real-ize this insight. (3) Similarly, the experimental results ofw/o IMP and its comparison with full C-SPCL demonstratethe importance to leverage the image-level prior-knowledgein the WSOD task and the effectiveness of the proposedlearning regime to realize this insight. (4) The experimentalcomparison between w/o INP and w/o IMP demonstratesthat in our framework, without leveraging the instance-levelprior- knowledge could cause relatively more obvious per-formance drop than the image-level prior-knowledge lever-aging. This is reasonable as, on one hand, the instance-levelprior-knowledge is the finer-level prior-knowledge. On theother hand, without using the instance-level prior-knowledgeregularizer PinsC would hurt the interaction between the

123

Page 12: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Table 1 Comparison among the proposed C-SPCL (OURS) and other baseline learning regimes on the VOC Test-2007 Set in terms of mAP

Aero Bike Bird Boat Bot Bus Car Cat Chr Cow Tab Dog Hor Mbik Pson Plat Shep Sofa Trai TV Av

Only CL 48.9 28.3 47.7 27.7 8.2 56.7 54.3 62.5 10.1 34.6 2.9 57.4 62.3 52.9 30.5 15.0 36.7 25.3 64.7 24.3 37.6

Only SPL 59.8 40.0 51.8 38 6.8 55.6 53.8 65.5 6.7 50.5 9.8 60.6 57.8 52.6 26.0 15.3 39.6 41.0 70.6 32.9 41.7

w/o INP 62.4 40.8 49.5 36.2 10.4 57.6 52.5 68.5 9.3 43.8 21.4 60.7 58.3 51.8 23.7 13.4 40.2 41.6 69.6 29.1 42.0

w/o IMP 62.7 45.1 47.9 39.2 4.5 62.5 54.1 65.1 11.9 47.0 18.3 62.0 67.3 54.7 27.2 13.3 40.0 43.5 64.8 28.5 43.0

Full C-SPCL 63.4 55.0 52.8 36.6 10.7 66.3 57.0 69.5 7.2 52.5 14.4 64.6 69.4 57.7 28.4 15.8 43.7 42.3 69.3 40.5 45.9

image-level confidence inference and the instance-level con-fidence inference. (5) Finally, the experimental results offull C-SPCL demonstrates the importance to collaborate theimage-level prior-knowledge and the instance-level prior-knowledge in the WSOD task and the effectiveness of theproposed learning regime to realize this insight.

5.3 Analysis of the Detection Framework

In this section, we implemented ablation studies to analyzethe influence of the computational components in differ-ent detection strategies. The experiments were conductedon the VOC Test-2007 Set. We first compared the proposedapproach with seven baseline detection strategies describedas follows:

– SS-AThe first baseline adopted the fc7 features extractedfrom the pre-trained AlexNet (Krizhevsky et al. 2012)to represent each object instances and directly appliedthe object detectors obtained from the C-SPCL regimeto detect objects in the test images, i.e., without usingthe localized object instances as pseudo ground-truth totrain the Fast R-CNN model. The object instances wereextracted by the selective search method (Uijlings et al.2013).

– SS-V This baseline was implemented by replacingthe features extracted from the pre-trained AlexNet(Krizhevsky et al. 2012) by the fc7 features extractedfrom the pre-trained VGG16 net (Simonyan and Zisser-man 2014). Other implementations were kept the samewith SS-A.

– SS-A-F This baseline adopted the fc7 features extractedfrom the pre-trained AlexNet (Krizhevsky et al. 2012)to represent each object instances and then trainedthe AlexNet-based Fast R-CNN model by using theobtained pseudo ground-truth object instances. Theobject instances were extracted by the selective searchmethod and the test process followed the Fast R-CNN.

– EB-A-F This baseline was implemented with almost thesame strategy as SS-A-F. The only difference betweenthem is that EB-A-F extracted object instances by the

Edge Box method (Zitnick and Dollár 2014) instead ofthe selective search method.

– EB-V-F This baseline adopted the fc7 features extractedfrom the pre-trained VGG16 net (Simonyan and Zisser-man 2014) to represent each object instances and thentrained the VGG16-based Fast R-CNN model by usingthe obtained pseudo ground-truth object instances. Theobject instances were extracted by the Edge Box methodand the test process followed the Fast R-CNN.

– EB-V-F-CS This baseline replaces the mask-out strat-egy (see Sect. 4.2) by directly using the classificationscores predicted by the classification network to buildthe instance-level learning curriculum. Other implemen-tations were kept the same with EB-V-F.

– EB-V-F-BR Compared with EB-V-F, this baseline addi-tionally learned the bounding-box regression functionin the Fast-RCNN model based on the obtained pseudoground-truth object locations, which is the full frame-work of the proposed approach.

Table 2 reported the comparison results of this experi-ment. From the comparison between SS-A and SS-A-F, wecan observe that training the Fast R-CNNmodel based on thepseudo ground-truth inferred by the proposedC-SPCLmodelis able to improve the detection performance to around 4%in terms of mAP. The comparison between SS-A-F and EB-A-F demonstrates that the object instances extracted by theedge box method can improve those extracted by the selec-tive search method by 1.6% mAP. The comparison betweenEB-A and EB-V indicates that the deeper CNN model canfurther benefit the final detection results by near to 5% mAP.While by using the Fast R-CNN to further fine-tune thefeature representation, such performance gain increases toaround 7% mAP (see the comparison between EB-A-F andEB-V-F). The comparison between EB-V-F and EB-V-F-CS indicates that directly using the classification scores tobuild the instance-level learning curriculum obtains 3.5%mAP decrease when compared with the mask-out strategy asused in our framework. This is due to that the mask-out strat-egy can provide additional discriminative power for helpingselect object proposals. Finally, the comparison betweenEB-V-F and EB-V-F-BR indicates that additionally learn-

123

Page 13: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Table 2 Comparison among the proposed framework (EB-V-F-BR) and other baselines on the VOC Test-2007 Set in terms of mAP

Aero Bike Bird Boat Bot Bus Car Cat Chr Cow Tab Dog Hor Mbik Pson Plat Shep Sofa Trai TV Av

SS-A 52.6 32.2 34.7 27.1 9.2 42.8 45.6 44.0 9.8 33.9 17.6 40.2 41.1 44.3 19.3 7.9 30.0 26.9 50.9 25.9 31.8

SS-V 60.0 36.5 43.0 28.9 9.1 52.5 47.3 61.6 3.3 41.3 21.2 48.7 43.1 49.6 22.4 7.8 37.0 40.1 58.0 21.8 36.7

SS-A-F 57.4 28.3 39.1 27.7 6.4 47.0 47.8 58.7 5.0 39.9 21.1 49.1 39.3 42.5 23.1 12.7 31.1 28.7 57.5 26.6 34.5

EB-A-F 54.3 33.3 41.2 29.2 9.9 50.7 52.8 60.0 7.2 42.3 20.9 48.1 42.0 40.5 26.9 12.2 33.1 31.7 53.1 33.0 36.1

EB-V-F 59.6 51.4 53.1 37.0 10.4 62.7 56.3 63.3 7.6 45.1 14.8 59.8 56.8 54.7 27.5 16.0 40.0 41.7 64.5 42.2 43.2

EB-V-F-CS 67.5 40.1 48.7 33.1 4.9 56.7 57.4 61 5.6 43 27.3 51.3 63.1 50.2 16.8 11.6 31.1 37.5 65.4 21.3 39.7

EB-V-F-BR 63.4 55.0 52.8 36.6 10.7 66.3 57.0 69.5 7.2 52.5 14.4 64.6 69.4 57.7 28.4 15.8 43.7 42.3 69.3 40.5 45.9

Fig. 5 Per-class frequency of error modes as well as the averaged erroracross all classes for the proposed detection framework

ing bounding-box regressors based on the inferred pseudoground-truth object locations can also boost the detectionresults by around 2.5% mAP.

To further analyze the localization errors of the weaklysupervised detectors trained by the proposed framework, wefollowed Cinbis et al. (2017) to implement the experiment tocategorize each of the extracted object proposal in the pos-itive training images into one of the following five cases:(i) correct localization (overlap ≥ 50%), (ii) proposal (com-pletely) inside ground-truth, (iii) ground-truth (completely)inside proposal, (iv) none of the above, but non-zero overlap(i.e., low overlap), and (v) no overlap. In Fig. 5, we showedthe frequency of these five cases for each object categoryand the averaged errors over all classes for the proposedframework. It can be observed that our framework can almosteliminate the “proposal inside ground-truth” error, and exceptfor “low overlap”, “ground-truth inside proposal” is the sec-ond largest error mode. This may be caused by the usageof the mask-out strategy, where the smaller proposals wouldusually obtain lower mask-out values. Similar to Cinbis et al.(2017), the “no overlap” error of our framework is very low,indicating that near to 95% of the selected object proposalsoverlap to some extent with their corresponding ground-truthbounding boxes.

Fig. 6 Analysis of the influence of the parameter setting of λc (leftfigure) and η (right figure). The left figure is drawn by fixing η to 2,while the right figure is drawn by fixing λc to select 30% of the traininginstances

In addition, for testing the influence of the parameter set-ting for the proposed framework, we set several differentvalues to λc and η and reported the obtained experimentalresults in Fig. 6. This experiment was implemented basedon EB-V-F. From Fig. 6, we can observe that the values ofthe λc and η would affect the performance of our approachby about 2% in terms of the mAP score. To be specific, ourapproach obtains high performance when λc are set to near0.3 (i.e., selecting 30% of instances for training). This is con-sistent with our intuitive as when λc is too large, the learnerwould select a large portion of instances during training,which will inevitably involve noisy instances into the learn-ing procedure and thus confuse the learner.Whereas when λcis set too small, only a very small portion of instances wouldbe selected during training, which, even clean, is not suffi-cient to train a powerful deep object detector. As for η, theresults in Fig. 6 show that its optimal value should be around2, indicating that the image-level prior knowledge and theinstance-level prior knowledge would play equally impor-tant roles at this scenario. This experiment demonstrates thatour physical meaning based strategy can help the parametersetting in practice.

5.4 Comparison with the State-of-the-Arts forObject Localization

In this section, we evaluate the proposed approach bycomparing the object annotation/localization performancewith a number of other state-of-the-art methods, includ-ing Pandey and Lazebnik (2011), Siva et al. (2012), Siva

123

Page 14: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

and Xiang (2011), Shi et al. (2015), Cinbis et al. (2017),Zhang et al. (2016), Bilen et al. (2015), Bilen and Vedaldi(2016), Ren et al. (2016), Wang et al. (2014b), KumarSingh et al. (2016), Kantorov et al. (2016) and Dibaet al. (2017).

The quantitative comparison results on the VOC Trainval-2007 set in terms of the CorLoc are reported in Table 3.Specifically, themethods located at the top four rows adoptedthe hand-crafted features to represent object instances, whichhas very limited representation capability as compared withthe features extracted by modern CNNs. Thus, they can-not obtain performance that is competitive to the proposedapproach. The methods located at the next four rows are therecent works using the pre-trained CNN features extractedby the AlexNet. As can be seen, these works can obtainobvious performance gain as compared with the methodsat the top four rows, which demonstrates the importanceof the powerful feature representation in the WSOD task.However, these methods are still not comparable with theproposed approach because, except the different learningcapability of the weakly supervised learning model, theydid not fine-turn their CNN models based on the pre-dicted pseudo ground-truth object locations. The methodslocated at the bottom seven rows are the most recent works,which usually leverage the helpful prior-knowledge fromsome finely annotated meta-training data (not related to theinvestigated task) and fine-tune the network during weaklysupervised learning. Here Bilen’s(L) indicates the WSDDNmethod (Bilen and Vedaldi 2016) trained based on thesame VGG16 net with our approach. Diba’s(2S) indicatesthe two-stage WCNN method (Diba et al. 2017) trainedbased on the VGG16 net, which has comparable experi-mental settings, e.g., the baseborn network architecture andthe training iterations, when compared with our approach.Thus, comparing with these approaches could provide rela-tively fairer evaluation of ourmethod. From the experimentalresults, we can observe that the proposed approach cansignificantly outperform theses state-of-the-arts. Some local-ization results of the proposed approach are also shown inFig. 7.

Besides, we also compare the proposed approach withthe state-of-the-art approach (Bilen and Vedaldi 2016) onthe sub training set of COCO-2014. The correspondingresults are reported in Table 4. In Table 4, Bilen andVedaldi (2016) can obtain around 41–54% in terms of Cor-Loc. Under this circumstance, our proposed approach canachieve more than 66% in terms of CorLoc, which achieves13.5–60.5% relative performance gain as compared withBilen and Vedaldi (2016). This experiment implies that ourmethod is not strongly tailored to Pascal VOC 2007 in partic-ular. Instead, the proposed approach can obtain even superiorperformance in more challenging scenarios.

5.5 Comparison with the State-of-the-Arts forObject Detection

In this section,we evaluate the proposed approach by com-paring it with other state-of-the-art methods, such as Pandeyand Lazebnik (2011), Siva and Xiang (2011), Russakovskyet al. (2012), Song et al. (2014a), Song et al. (2014b), Cinbiset al. (2017), Bilen et al. (2014), Bilen et al. (2015), Bilenand Vedaldi (2016), Ren et al. (2016), Zhang et al. (2016),Wang et al. (2014b), Kumar Singh et al. (2016), Kantorov etal. (2016) andDiba et al. (2017).The quantitative comparisonresults on theVOCTest-2007 set in terms of theAP scores areshown in Table 5, where the methods located at the top threerows adopted the hand-crafted features. Being consistentwith the performance in object localization experiment, thisgroup of methods cannot obtain satisfactory performance.The methods located at the next seven rows are the worksusing the pre-trained CNN features mainly extracted by theAlexNet. As mentioned by Bilen et al. (2015), Wang’s algo-rithm (Wang et al. 2014b) requires a careful tuning of itsparameters to obtain good performance for each category.Thus, Wang et al. (2014b) has achieved more obvious per-formance gain as compared with the other methods in thisgroup. The methods located at the bottom seven rows arethe most recent works which are usually equipped with moreadvanced network architecture, such as the VGG16 net, anddetection framework, e.g., the Fast R-CNN. As can be seen,the proposed approach can significantly outperform the otherstate-of-the-arts, including themost recent approaches. Espe-cially, obvious performance gain can be found for the objectcategories like Aero, Bird, Dog, and so on.

We further compare the performance of the proposedapproach with other state-of-the-art methods on the VOCTest-2010 set and the sub validation set of COCO-2014.The weakly supervised object detectors of the comparedapproaches are trained on the VOC Trainval-2010 set andthe sub training set of COCO-2014, respectively. The experi-mental results on the VOC Test-2010 are reported in Table 6.From the experimental results we can observe that, com-pared with the VOC Test-2007 set, the VOC Test-2010 settends to be more challenging, which leads to the perfor-mance drop of the recent state-of-the-art approaches. Eventhough, in this dataset, the proposed method can still be ableto obtain encouraging performance that is superior to otherstate-of-the-art approaches. Specifically, our approach canachieve more than 1.3–11.7% higher average mAP as com-pared with the other state-of-the-arts. From the comparedstate-of-the-art methods, Diba et al. (2017) achieved themostcompetitive performance. In Diba et al. (2017), the pseudoground-truth obtained for training Fast R-CNN is based onthe deep features fine-tuned on the Pascal dataset, whereasthe pseudo ground-truth obtained in our approach for train-ing Fast R-CNN is only based on deep features pre-trained on

123

Page 15: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Table 3 Comparison of the object localization performance on the VOC Trainval-2007 set in terms of CorLoc

Aero Bike Bird Boat Bot Bus Car Cat Chr Cow Tab Dog Hor Mbik Pson Plat Shep Sofa Trai TV Av

Pandey andLazebnik (2011)

50.9 56.7 – 10.6 0.0 56.6 – – 2.5 – 14.3 – 50.0 53.5 11.2 5.0 – 34.9 33.0 40.6 –

Siva et al. (2012) 45.8 21.8 30.9 20.4 5.3 37.6 40.8 51.6 7.0 29.8 27.5 41.3 41.8 47.3 24.1 12.2 28.1 32.8 48.7 9.4 30.2

Siva and Xiang (2011) 42.4 46.5 18.2 8.8 2.9 40.9 73.2 44.8 5.4 30.5 19.0 34.0 48.8 65.3 8.2 9.4 16.7 32.3 54.8 5.5 30.4

Shi et al. (2015) 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.9 51.5 64.6 18.2 3.1 20.9 34.7 63.4 5.9 36.2

Zhang et al. (2016) 71.0 27.2 48.8 40.9 6.6 51.6 46.1 54.6 5.4 58.9 15.5 52.7 60.3 50.6 29.2 17.1 52.1 31.9 56.3 17.6 39.7

Bilen et al. (2015) 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.5 49.5 60.0 19.8 39.2 41.7 30.1 50.2 44.1 43.7

Ren et al. (2016) 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.4 44.2 75.5 41.2 39.6 47.4 32.3 49.8 18.6 43.9

Wang et al. (2014b) 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.7 58.3 69.5 14.1 38.3 58.8 47.2 49.1 60.9 48.5

Kumar Singhet al. (2016)

58.8 – 49.6 15.4 – – 64.9 59.0 – 43.2 – 51.2 57.5 63.1 – – – – 54.4 – –

Cinbis et al. (2017) 65.3 55.0 52.4 48.3 18.2 66.4 77.8 35.6 26.5 67.0 46.9 48.4 70.5 69.1 35.2 35.2 69.6 43.4 64.6 43.7 52.0

Li et al. (2016) 78.2 67.1 61.8 38.1 36.1 61.8 78.8 55.2 28.5 68.8 18.5 49.2 64.1 73.5 21.4 47.4 64.6 22.3 60.9 52.3 52.4

Bilen andVedaldi(L) (2016)

65.1 58.8 58.5 33.1 39.8 68.3 60.2 59.6 34.8 64.5 30.5 43.0 56.8 82.4 25.5 41.6 61.5 55.9 65.9 63.7 53.5

Diba et al.(2S) (2017) 81.2 70.0 62.5 41.7 38.2 63.4 81.1 57.7 30.4 70.3 21.7 51.0 65.9 75.7 23.9 47.9 67.5 25.6 62.4 53.9 54.6

Kantorov et al. (2016) 83.3 68.6 54.7 23.4 18.3 73.6 74.1 54.1 8.6 65.1 47.1 59.5 67.0 83.5 35.3 39.9 67.0 49.7 63.5 65.2 55.1

OURS 83.2 65.0 72.0 64.6 16.8 75.3 79.1 81.3 23.6 80.1 19.0 77.2 84.3 82.9 53.0 28.6 68.8 56.8 87.0 49.6 62.4

Fig. 7 Some visualization examples of the object localization results.The green bounding-boxes indicate the object localization results gen-erated by our approach, while the red ones are the ground-truth. Fromthe figure we can observe that our approach can correctly localize the

objects of interest in some complex image scenes. However, it may stillfailwhen facing imageswith intensive small object instances, inaccurateobject proposals, and severely occluded objects (Color figure online)

Table 4 Comparison of theobject localization performanceon the sub training set ofCOCO-2014 in terms of CorLoc

Aero Bus Cat Dog Train Av

Bilen and Vedaldi(F) (2016) 57.2 54.1 45.8 28.8 45.0 46.2

Bilen and Vedaldi(M) (2016) 61.9 56.9 38.2 25.7 44.8 45.5

Bilen and Vedaldi(L) (2016) 38.6 30.7 43.8 24.5 38.3 35.2

OURS 77.1 60.1 52.7 46.0 69.8 61.1

123

Page 16: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Table 5 Comparison of the object detection performance on the VOC Test-2007 set in terms of mAP

Aero Bike Bird Boat Bot Bus Car Cat Chr Cow Tab Dog Hor Mbik Pson Plat Shep Sofa Trai TV Av

Pandey and Lazebnik (2011) 11.5 – – 3.0 – – – – – – – – 20.3 9.1 – – – – 13.2 – –

Siva and Xiang (2011) 13.4 44.0 3.1 3.1 0.0 31.2 43.9 7.1 0.1 9.3 9.9 1.5 29.4 38.3 4.6 0.1 0.4 3.8 34.2 0.0 13.9

Russakovsky et al. (2012) 30.8 25.0 – 3.6 – 26.0 – – – – – – 21.3 29.9 – – – – – – 15.0

Song et al. (2014a) 27.6 41.9 19.7 9.1 10.4 35.8 39.1 33.6 0.6 20.9 10.0 27.7 29.4 39.2 9.1 19.3 20.5 17.1 35.6 7.1 22.7

Song et al. (2014b) 36.3 47.6 23.3 12.3 11.1 36.0 46.6 25.4 0.7 23.5 12.5 23.5 27.9 40.9 14.8 19.2 24.2 17.1 37.7 11.6 24.6

Ren et al. (2016) 41.3 39.7 22.1 9.5 3.9 41.0 45.0 19.1 1.0 34.0 16.0 21.3 32.5 43.4 21.9 19.7 21.5 22.3 36.0 18.0 25.5

Bilen et al. (2014) 42.2 43.9 23.1 9.2 12.5 44.9 45.1 24.9 8.3 24.0 13.9 18.6 31.6 43.6 7.6 20.9 26.6 20.6 35.9 29.6 26.4

Bilen et al. (2015) 46.2 46.9 24.1 16.4 12.2 42.2 47.1 35.2 7.8 28.3 12.7 21.5 30.1 42.4 7.8 20.0 26.8 20.8 35.8 29.6 27.7

Zhang et al. (2016) 45.4 21.8 35.0 23.8 9.2 50.1 43.0 41.8 1.8 26.9 27.6 37.9 41.2 43.7 17.0 11.9 24.8 22.5 48.8 25.9 30.1

Wang et al. (2014b) 48.8 41.0 23.6 12.1 11.1 42.7 40.9 35.5 11.1 36.6 18.4 35.3 34.8 51.3 17.2 17.4 26.8 32.8 35.1 45.6 30.9

Kumar Singh et al. (2016) 53.9 – 37.7 13.7 – – 56.6 51.3 – 24.0 – 38.5 47.9 47.0 – – – – 48.4 – –

Cinbis et al. (2017) 39.3 43.0 28.8 20.4 8.0 45.5 47.9 22.1 8.4 33.5 23.6 29.2 38.5 47.9 20.3 20.0 35.8 30.8 41.0 20.1 30.2

Bilen and Vedaldi(L) (2016) 39.4 50.1 31.5 16.3 12.6 64.5 42.8 42.6 10.1 35.7 24.9 38.2 34.4 55.6 9.4 14.7 30.2 40.7 54.7 46.9 34.8

Kantorov et al. (2016 57.1 52.0 31.5 7.6 11.5 55.0 53.1 34.1 1.7 33.1 49.2 42.0 47.3 56.6 15.3 12.8 24.8 48.9 44.4 47.8 36.3

Li et al. (2016) 54.5 47.4 41.3 20.8 17.7 51.9 63.5 46.1 21.8 57.1 22.1 34.4 50.5 61.8 16.2 29.9 40.7 15.9 55.3 40.2 39.5

Diba et al.(2S) (2017) 48.2 58.9 37.3 27.8 15.3 69.8 55.2 41.1 10.1 42.7 28.6 40.4 47.3 62.3 12.9 21.2 44.3 52.2 59.1 53.1 41.4

OURS 63.4 55.0 52.8 36.6 10.7 66.3 57.0 69.5 7.2 52.5 14.4 64.6 69.4 57.7 28.4 15.8 43.7 42.3 69.3 40.5 45.9

Table 6 Comparison of the object detection performance on the VOC Test-2010 set in terms of mAP

Aero Bike Bird Boat Bot Bus Car Cat Chr Cow Tab Dog Hor Mbik Pson Plat Shep Sofa Trai TV Av

Kumar Singh et al. (2016) 53.5 – 37.5 8.0 – – 44.2 49.4 – 33.7 – 43.8 42.5 47.6 – – – – 40.6 – –

Cinbis et al. (2017) 44.6 42.3 25.5 14.1 11.0 44.1 36.3 23.2 12.2 26.1 14.0 29.2 36.0 54.3 20.7 12.4 26.5 20.3 31.2 23.7 27.4

Bilen and Vedaldi(L) (2016) 66.3 33.2 47.2 25.3 6.3 47.7 30.1 61.7 5.2 43.2 2.5 61.7 50.7 51.5 21.1 4.8 30.5 13.9 52.8 16.0 33.6

Bilen and Vedaldi(F) (2016) 54.0 54.1 35.8 8.8 19.7 53.7 37.8 34.2 8.6 29.3 6.7 39.0 47.7 62.6 12.1 18.2 29.5 23.1 46.3 38.0 33.0

Bilen and Vedaldi(M) (2016) 54.3 47.7 34.5 18.4 20.7 55.2 40.7 31.6 10.2 34.5 17.6 27.3 44.3 61.1 9.4 19.8 32.0 28.1 48.1 40.7 33.8

Kantorov et al. (2016) 63.4 53.1 34.7 6.0 15.6 52.3 43.2 33.0 6.1 34.3 38.2 47.7 44.2 60.5 21.1 16.6 24.1 33.0 32.1 46.3 35.3

Diba et al.(2S) (2017) – – – – – – – – – – – – – – – – – – – – 37.8

OURS 68.2 41.9 50.6 27.3 10.9 55.1 38.0 67.6 11.5 46.8 2.6 66.3 55.4 58.2 28.6 9.8 33.7 18.1 66.2 25.1 39.1

Table 7 Comparison of theobject detection performance onthe sub validation set ofCOCO-2014 in terms of mAP

Aero Bus Cat Dog Train Av

Bilen and Vedaldi(F) (2016) 28.8 35.1 15.6 28.9 33.8 28.4

Bilen and Vedaldi(M) (2016) 34.1 39.0 14.3 30.4 24.3 28.4

Bilen and Vedaldi(L) (2016) 17.5 15.3 12.6 27.8 35.1 21.7

OURS 47.2 41.4 45.6 30.6 56.2 44.2

the ImageNet dataset. Even though, our proposed approachcan still achieve the outperforming final detection results.

The experimental results on the sub validation set ofCOCO-2014 are reported in Table 7. From the experimen-tal results we can observe that the COCO dataset is evenfar more challenging than the VOC-2010, where the state-of-the-art WSOD method, e.g., Bilen and Vedaldi (2016),can only obtain around 23–32% in terms of mAP. Under thiscircumstance, the proposed approach can still obtain 32.8–78.4% relative performance gain as compared with Bilen and

Vedaldi (2016).4 Even though, the performance of the pro-posed approach on this dataset is still far from satisfactory.To our best knowledge, this is because the size of the objectsin the COCO dataset is usually small and thus most of theextracted object proposal regions cannot cover the object ofinterest, leading to the dramatically increased learning ambi-guity that challenges all the WSOD methods.

4 The results of Bilen and Vedaldi (2016) in Tables 4, 6 and 7 areobtained with our implementation.

123

Page 17: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

6 Conclusion

In this paper, we proposed a powerful WSOD frameworkwith a novel C-SPCL learning regime, which collaboratesthe instance-level confidence inference and the image-levelconfidence inference with joint optimization process, buildshelpful learning curriculums based on the prior-knowledgeto guide the confidence inference regime throughout theentire learningprocedure, and embeds the self-paced learningmechanism to increase the learning robustness. Comprehen-sive experiments on widely used benchmark datasets havedemonstrated the insights revealed in this paper as well asthe effectiveness of the proposed approach. In the future,we plane to widely apply the proposed learning frameworkinto more weakly supervised computer vision tasks, such asobject co-segmentation (Han et al. 2018a;Wang et al. 2014a)and co-saliency detection (Zhang et al. 2018;Yao et al. 2017).

Acknowledgements This work was supported in part by the “NationalKey R&D Program of China” (2017YFB0502904), the National Sci-ence Foundation of China under Grants 61876140 and 61773301, theFundamental Research Funds for the Central Universities under GrantJBZ170401, and the China Postdoctoral Support Scheme for InnovativeTalents under Grant BX20180236.

References

Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? InCVPR.

Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curricu-lum learning. In ICML.

Bilen, H., Pedersoli, M., & Tuytelaars, T. (2014). Weakly supervisedobject detection with posterior regularization. In BMVC.

Bilen, H., Pedersoli, M., & Tuytelaars, T. (2015). Weakly supervisedobject detection with convex clustering. In CVPR.

Bilen, H., & Vedaldi, A. (2016). Weakly supervised deep detectionnetworks. In CVPR.

Chen, X., & Gupta, A. (2015). Webly supervised learning of convolu-tional networks. In ICCV.

Cinbis, R. G., Verbeek, J., & Schmid, C. (2017). Weakly supervisedobject localization with multi-fold multiple instance learning.IEEE Transactions on Pattern Analysis and Machine Intelligence,39(1), 189–203.

Deselaers, T., Alexe, B., & Ferrari, V. (2010). Localizing objects whilelearning their appearance. In ECCV.

Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervisedlocalization and learning with generic knowledge. InternationalJournal of Computer Vision, 100(3), 275–293.

Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., & Van Gool, L.(2017). Weakly supervised cascaded convolutional networks. InCVPR.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisser-man, A. (2010). The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2), 303–338.

Everingham, M., Zisserman, A., Williams, C. K., Van Gool, L., Allan,M., Bishop, C. M., Chapelle, O., Dalal, N., Deselaers, T., Dorkó,G., et al. (2007). The pascal visual object classes challenge 2007(voc2007) results.

Girshick, R. (2015). Fast r-cnn. In ICCV.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich featurehierarchies for accurate object detection and semantic segmenta-tion. In CVPR.

Gokberk Cinbis, R., Verbeek, J., & Schmid, C. (2014). Multi-fold miltraining for weakly supervised object localization. In CVPR.

Han, J., Quan, R., Zhang, D., & Nie, F. (2018a). Robust objectco-segmentation using background prior. IEEE Transactions onImage Processing, 27(4), 1639–1651.

Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018b). Advanceddeep-learning techniques for salient and category-specific objectdetection: A survey. IEEE Signal ProcessingMagazine, 35(1), 84–100.

Han, L., Zhang, D., Huang, D., Chang, X., Ren, J., Luo, S., & Han, J.(2017). Self-paced mixture of regressions. In IJCAI.

Jiang, L., Meng, D., Mitamura, T., & Hauptmann, A. G. (2014a). Easysamples first: Self-paced reranking for zero-example multimediasearch. In ACM-MM.

Jiang, L., Meng, D., Yu, S.-I., Lan, Z., Shan, S., & Hauptmann, A.(2014b). Self-paced learning with diversity. In NIPS.

Jiang, L., Meng, D., Zhao, Q., Shan, S., & Hauptmann, A. G. (2015).Self-paced curriculum learning. In AAAI.

Jie, Z., Wei, Y., Jin, X., Feng, J., & Liu, W. (2017). Deep self-taughtlearning for weakly supervised object localization. In CVPR.

Kantorov, V., Oquab, M., Cho, M., & Laptev, I. (2016). Contextloc-net: Context-aware deep network models for weakly supervisedlocalization. In ECCV.

Khan, F., Mutlu, B., & Zhu, X. (2011). How do humans teach: Oncurriculum learning and teaching dimension. In NIPS.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi-fication with deep convolutional neural networks. In NIPS.

Kumar, M. P., Packer, B., & Koller, D. (2010). Self-paced learning forlatent variable models. In NIPS.

Kumar Singh, K., Xiao, F., & Jae Lee, Y. (2016). Track and transfer:Watching videos to simulate strong human supervision forweakly-supervised object detection. In CVPR.

Li, D., Huang, J.-B., Li, Y., Wang, S., & Yang, M.-H. (2016). Weaklysupervised object localizationwith progressive domain adaptation.In CVPR.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays,J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollr, P. (2014).Microsoft coco: Common objects in context. arXiv preprintarXiv:1405.0312.

Meng, D., Zhao, Q., & Jiang, L. (2017). Theoretical understanding ofself-paced learning. Information Sciences, 414, 319–328.

Pandey,M.,&Lazebnik, S. (2011). Scene recognition andweakly super-vised object localization with deformable part-based models. InICCV.

Ren, W., Huang, K., Tao, D., & Tan, T. (2016). Weakly supervisedlarge scale object localization with multiple instance learning andbag splitting. IEEE Transactions on Pattern Analysis andMachineIntelligence, 38(2), 405–416.

Russakovsky, O., Lin, Y., Yu, K., & Fei-Fei, L. (2012). Object-centricspatial pooling for image classification. In ECCV.

Shi, M., & Ferrari, V. (2016). Weakly supervised object localizationusing size estimates. In ECCV.

Shi, Z.,Hospedales, T.M.,&Xiang, T. (2015).Bayesian jointmodellingfor object localisation in weakly labelled images. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 37(10),1959–1972.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556.

Siva, P., Russell, C., & Xiang, T. (2012). In defence of negative miningfor annotating weakly labelled data. In ECCV.

123

Page 18: Leveraging Prior-Knowledge for Weakly Supervised Object ...static.tongtianta.site/paper_pdf/277e37bc-3046-11e... · variance in Siva et al. (2012). Afterwards, Siva et al. (2013)

International Journal of Computer Vision

Siva, P., Russell, C., Xiang, T., & Agapito, L. (2013). Looking beyondthe image:Unsupervised learning for object saliency anddetection.In CVPR.

Siva, P., &Xiang, T. (2011).Weakly supervised object detector learningwith model drift detection. In ICCV.

Song, H. O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., &Darrell, T. (2014a). On learning to localize objects with minimalsupervision. arXiv preprint arXiv:1403.1024.

Song, H. O., Lee, Y. J., Jegelka, S., & Darrell, T. (2014b). Weakly-supervised discovery of visual pattern configurations. In NIPS.

Spitkovsky, V. I., Alshawi, H., & Jurafsky, D. (2009). Baby steps: Howless is more in unsupervised dependency parsing. NIPS: GrammarInduction, Representation of Language and Language Learning.

Supancic, D., & Ramanan, J. S. (2013). Self-paced learning for long-term tracking. In CVPR.

Tang, Y., Yang, Y.-B., & Gao, Y. (2012). Self-paced dictionary learningfor image classification. In ACM-MM.

Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W.(2013). Selective search for object recognition. International Jour-nal of Computer Vision, 104(2), 154–171.

Wang, L., Hua, G., Sukthankar, R., Xue, J., & Zheng, N. (2014a). Videoobject discovery and co-segmentation with extremely weak super-vision. In ECCV.

Wang, C., Ren, W., Huang, K., & Tan, T. (2014b). Weakly supervisedobject localization with latent category learning. In ECCV.

Yang, X., Song, Q., & Wang, Y. (2007). A weighted support vectormachine for data classification. International Journal of PatternRecognition and Artificial Intelligence, 21(05), 961–976.

Yao, X., Han, J., Zhang, D., & Nie, F. (2017). Revisiting co-saliencydetection: A novel approach based on two-stage multi-viewspectral rotation co-clustering. IEEE Transactions on Image Pro-cessing, 26(7), 3196–3209.

Zhang, D., Fu, H., Han, J., Borji, A., & Li, X. (2018). A review ofco-saliency detection algorithms: Fundamentals, applications, andchallenges.ACMTransactions on Intelligent Systems and Technol-ogy, 9(4), 38.

Zhang,D.,Meng,D.,&Han, J. (2017a).Co-saliencydetectionvia a self-paced multiple-instance learning framework. IEEE Transactionson Pattern Analysis and Machine Intelligence, 39(5), 865–878.

Zhang, D., Meng, D., Zhao, L., & Han, J. (2016). Bridging saliencydetection to weakly supervised object detection based on self-paced curriculum learning. In IJCAI.

Zhang, D., Yang, L., Meng, D., Xu, D., & Han, J. (2017b). Spftn: Aself-paced fine-tuning network for segmenting objects in weaklylabelled videos. In CVPR.

Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object pro-posals from edges. In ECCV.

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

123