Object Recognition with and without Objects · Object recognition is a long-lasting battle in...

Object Recognition with and without Objects

Zhuotun Zhu, Lingxi Xie, Alan YuilleJohns Hopkins University, Baltimore, MD, USA{zhuotun, 198808xc, alan.l.yuille}@gmail.com

AbstractWhile recent deep neural networks have achieved apromising performance on object recognition, theyrely implicitly on the visual contents of the wholeimage. In this paper, we train deep neural net-works on the foreground (object) and background(context) regions of images respectively. Consider-ing human recognition in the same situations, net-works trained on the pure background without ob-jects achieves highly reasonable recognition per-formance that beats humans by a large margin ifonly given context. However, humans still outper-form networks with pure object available, whichindicates networks and human beings have differ-ent mechanisms in understanding an image. Fur-thermore, we straightforwardly combine multipletrained networks to explore different visual cueslearned by different networks. Experiments showthat useful visual hints can be explicitly learnedseparately and then combined to achieve higherperformance, which verifies the advantages of theproposed framework.

1 IntroductionObject recognition is a long-lasting battle in computer vision,which aims to categorize an image according to the visualcontents. In recent years, we have witnessed an evolution inthis research field. Thanks to the availability of large-scaleimage datasets [Deng et al., 2009] and powerful computa-tional resources, it becomes possible to train a very deep con-volutional neural network (CNN) [Krizhevsky et al., 2012],which is much more efficient beyond the conventional Bag-of-Visual-Words (BoVW) model [Csurka et al., 2004].

It is known that an image contains both foreground andbackground visual contents. However, most object recogni-tion algorithms focus on recognizing the visual patterns onlyon the foreground region [Zeiler and Fergus, 2014]. Althoughit has been proven that background (context) information alsohelps recognition [Simonyan and Zisserman, 2014], it still re-mains unclear if a deep network can be trained individually tolearn visual information only from the background region. Inaddition, we are interested in exploring different visual pat-terns by training neural networks on foreground and back-

BBX(es)

FGSet

BGSetW/ BBX

W/O BBX

OrigSet

HybridSet

Figure 1: Procedures of dataset generation. First, we denote theoriginal set as the OrigSet, divided into two sets, one with theground-truth bounding box (W/ BBX) and the other one without(W/O BBX). Then the set with labelled bounding box(es) are fur-ther processed by setting regions inside all ground-truth to be 0’s tocompose the BGSet while cropping the regions out to produce theFGSet. In the end, add the images without bounding boxes withFGSet to construct the HybridSet. Please note that some images ofthe FGSet have regions to be black (0’s) since these images are la-belled with multiple objects belonging to the same class, which arecropped according to the smallest rectangle frame that includes allobject bounding boxes in order to keep as less background informa-tion as possible on FGSet. Best viewed in color.

ground separately for object recognition, which is less studiedbefore.

In this work, we investigate the above problems by explic-itly training multiple networks for object recognition. We firstconstruct datasets from ILSVRC2012 [Russakovsky et al.,2015], i.e., one foreground set and one background set, bytaking advantage of the ground-truth bounding box(es) pro-vided in both training and testing cases. After dataset con-struction, we train deep networks individually to learn fore-ground (object) and background (context) information, re-spectively. We find that, even only trained on pure back-ground contexts, the deep network can still converge andmakes reasonable prediction (14.4% top-1 and nearly 30%top-5 classification accuracy on the background validationset). To make a comparison, we are further interested in thehuman recognition performance on the constructed datasets.Deep neural networks outperform non-expert humans in fine-grained recognition, and humans sometimes make errors be-

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

3609

cause they cannot memorize all categories of datasets [Rus-sakovsky et al., 2015]. In this case, to more reasonably com-pare the recognition ability of humans and deep networks, wefollow [Huh et al., 2016] to merge all the 1,000 fine-grainedcategories of the original ILSVRC2012, resulting in a 127-class recognition problem meanwhile keeping the number oftraining/testing images unchanged. We find that human be-ings tend to pay more attention to the object while networksput more emphasis on context than humans for classification.By visualizing the patterns captured by the background net,we find that some visual patterns are not available in the fore-ground net. Therefore, we apply networks on the foregroundand background regions respectively via the given ground-truth bounding box(es) or extracting object proposals withoutavailable ones. We find that the linear combination of multi-ple neural networks can give higher performance.

To summarize, our main contributions are three folds: 1)We demonstrate that learning foreground and background vi-sual contents separately is beneficial for object recognition.Training a network based on pure background although be-ing wired and challenging, is technically feasible and cap-tures highly useful visual information. 2) We conduct hu-man recognition experiments on either pure background orforeground regions to find that human beings outperform net-works on pure foreground while are beaten by networks onpure background, which implies the different mechanisms ofunderstanding an image between networks and humans. 3)We straightforwardly combine multiple neural networks toexplore the effectiveness of different learned visual clues un-der two conditions with and without ground-truth boundingbox(es), which gives promising improvement over the base-line deep neural networks.

2 Related WorkObject recognition is fundamental in computer vision field,which is aimed to understand the semantic meaning amongan image via analyzing its visual contents. Recently, re-searchers have extended the traditional cases [Lazebnik etal., 2006] to fine-grained [Wah et al., 2011] [Nilsback andZisserman, 2008] [Lin et al., 2015], and large-scale [Xiaoet al., 2010] [Griffin et al., 2007] tasks. Before the ex-ploding development of deep learning, the dominant BoVWmodel [Csurka et al., 2004] represents every single imagewith a high-dimensional vector. It is typically composedof three consecutive steps, i.e., descriptor extraction [Lowe,2004] [Dalal and Triggs, 2005], feature encoding [Wang etal., 2010] [Perronnin et al., 2010] and feature summariza-tion [Lazebnik et al., 2006].

The milestone Convolutional Neural Network (CNN) istreated as a hierarchical model for large-scale visual recog-nition. In past years, neural networks have already beenproved to be effective for simple recognition tasks [LeCunet al., 1990]. More recently, the availability of large-scaletraining data (e.g., ImageNet [Deng et al., 2009]) and pow-erful computation source like GPUs make it practical to traindeep neural networks [Krizhevsky et al., 2012] [Zhu et al.,2016] [Fang et al., 2015] [He et al., 2016b] which signifi-cantly outperform the conventional models. Even deep fea-

tures have been proved to be very successful on vision taskslike object discovery [Wang et al., 2015b], object recogni-tion [Xie et al., 2017], etc. A CNN is composed of numerousstacked layers, in which responses from the previous layerare then convoluted and activated by a differentiable function,followed by a non-linear transformation [Nair and Hinton,2010] to avoid over-fitting. Recently, several efficient meth-ods were proposed to help CNNs converge faster and pre-vent over-fitting [Krizhevsky et al., 2012]. It is believed thatdeeper networks produce better recognition results [Szegedyet al., 2015][Simonyan and Zisserman, 2014], but also re-quires engineering tricks to be trained very well [Ioffe andSzegedy, 2015] [He et al., 2016a].

Very few techniques on background modeling [Bewley andUpcroft, 2017] have been developed for object recognition,despite the huge success of deep learning methods on variousvision tasks. [Shelhamer et al., 2016] proposed the fully con-volutional networks (FCN) for semantic segmentation, whichare further trained on foreground and background defined byshape masks. They find it is not vital to learn a specificallydesigned background model. For face matching, [Sandersonand Lovell, 2009] developed methods only on the croppedout faces to alleviate the possible correlations between facesand their backgrounds. [Han et al., 2015] modeled the back-ground in order to detect the salient objects from the back-ground. [Doersch et al., 2014] showed using the object patchto predict its context as supervisory information can help dis-cover object clusters, which is consistent with our motivationto utilize the pure context for visual recognition. To our bestknowledge, we are the first to explicitly learn both the fore-ground and background models and then combine them to-gether to be beneficial for the object recognition.

Recently, researchers pay more attention to human experi-ments on objects recognition. Zhou et al. [Zhou et al., 2015]invited Amazon Mechanical Turk (AMT) to identify the con-cept for segmented images with objects. They found that theCNN trained for scene classification automatically discoversmeaningful object patches. While in our experiments, we areparticularly interested in the different emphasis between hu-man beings and networks for recognition task.

Last but not the least, visualization of CNN activations isan effective method to understand the mechanism of CNNs.In [Zeiler and Fergus, 2014], a de-convolutional operationwas proposed to capture visual patterns on different layers ofa trained network. [Simonyan and Zisserman, 2014] and [Caoet al., 2015] show that different sets of neurons are activatedwhen a network is used for detecting different visual patterns.In this work, we will use a much simpler way of visualizationwhich is inspired by [Wang et al., 2015a].

3 Training Networks

Our goal is to explore the possibility and effectiveness oftraining networks on foreground and background regions, re-spectively. Here, foreground and background regions are de-fined by the annotated ground-truth bounding box(es) of eachimage. All the experiments are done on the datasets com-posed from the ILSVRC2012.


3610

Dataset Image Description # Training Image # Testing Image Testing AccuracyOrigSet Original Image 1,281,167 50,000 58.19%, 80.96%FGSet Foreground Image 544,539 50,000 60.82%, 83.43%BGSet Background Image 289,031 50,000 14.41%, 29.62%HybridSet Original Image or Foreground Image 1,281,167 50,000 61.29%, 83.85%

Table 1: The configuration of different image datasets originated from the ILSVRC2012. The lass column denotes the testing performanceof trained AlexNet in terms of top-1 and top-5 classification accuracy on corresponding datasets, e.g., the BGNet gives 14.41% top-1 and29.62% top-5 accuracy on the testing images of BGSet.

3.1 Data PreparationThe ILSVRC2012 dataset [Russakovsky et al., 2015] con-tains about 1.3M training and 50K validation images.Throughout this paper, we refer to the original dataset asOrigSet and the validation images are regarded as our test-ing set. Among OrigSet, 544,539 training images and all50,000 testing images are labeled with at least one ground-truth bounding box. For each image, there is only one type ofobject annotated according to its ground-truth class label.

We construct three variants of training sets and two vari-ants of testing sets from OrigSet by details below. An il-lustrative example of data construction is shown in Figure 1.The configuration of different image datasets are summarizedin Table 1.

• The foreground dataset (FGSet) is composed of all im-ages with at least one available ground-truth bound-ing box. For each image, we first compute the small-est rectangle frame which includes all object boundingboxes, then based on which the image inside the frameis cropped to be used as the training/testing data. Notethat if an image has multiple object bounding boxes be-longing to the same class, we set all the background re-gions inside the frame to be 0’s to keep as little contextas possible on FGSet. There are totally 544,539 trainingimages and 50,000 testing images on FGSet. Since theannotation is on the bounding box level, images of theFGSet may contain some background information.• The construction of the background dataset (BGSet)

consists of two stages. First, for each image with at leastone ground-truth bounding box available, regions insideevery ground-truth bounding box are set to 0’s. Chancesare that almost all the pixels of one image are set to 0sif its object consists of nearly 100 percent of its wholeregion. Therefore during training, we discard those sam-ples with less than 50% background pixels preserved,i.e., the foreground frame is larger than half of the entireimage, so that we can maximally prevent using those lessmeaningful background contents (see Figure 1). How-ever in testing, we keep all the processed images, in theend, 289,031 training images and 50,000 testing imagesare preserved.• To increase the amount of training data for foreground

classification, we also construct a hybrid dataset, abbre-viated as the HybridSet. The HybridSet is composedof all images of the original training set. If at least oneground-truth bounding box is available, we pre-processthis image as described on FGSet, otherwise, we simply

keep this image without doing anything. As boundingbox annotation is available in each testing case, the Hy-bridSet and the FGSet contain the same testing data.Training with the HybridSet can be understood as asemi-supervised learning process.

3.2 Training and TestingWe trained the milestone AlexNet [Krizhevsky et al., 2012]using the CAFFE library [Jia et al., 2014] on different trainingsets as mentioned in the Sec 3.1.

The base learning rate is set to 0.01, and reduced by 1/10for every 100,000 iterations. The moment is set to be 0.9and the weight decay parameter is 0.0005. A total number of450,000 iterations is conducted, which corresponds to around90 training epochs on the original dataset. Note that bothFGSet and BGSet contain less number of images than that ofOrigSet and HybridSet, which leads to a larger number oftraining epochs, given the same training iterations. In thesecases, we adjust the dropout ratio as 0.7 to avoid the overfit-ting issue. We refer to the network trained on the OrigSetas the OrigNet, and similar abbreviated names also apply toother cases, i.e., the FGNet, BGNet and HybridNet.

During testing, we report the results by using the commondata augmentation of averaging 10 patches from the 5 cropsand 5 flips. After all forward passes are done, the averageoutput on the final (fc-8) layer is used for prediction. Weadopt the MatConvNet [Vedaldi and Lenc, 2015] platform forperformance evaluation.

4 ExperimentsThe testing accuracy of AlexNet trained on correspondingdataset are given in the last column of Table 1. We canfind that the BGNet produces reasonable classification re-sults: 14.41% top-1 and 29.62% top-5 accuracy (while therandom guess gets 0.1% and 0.5%, respectively), which isa bit surprising considering it makes classification decisionsonly on background contents without any foreground objectsgiven. This demonstrates that deep neural networks are capa-ble of learning pure contexts to infer objects even being fullyoccluded. Not surprisingly, the HybridNet gives better per-formance than the FGNet due to more training data available.

4.1 Human RecognitionAs stated before, to alleviate the possibility of wrongly classi-fying images for humans beings due to high volume of classesup to 1,000 on the original ILSVRC2012, we follow [Huh etal., 2016] by merging all the fine-grained categories, resulting


3611

Dataset AlexNet HumanOrigSet 58.19%, 80.96% −, 94.90%?

BGSet 14.41%, 29.62% −, −OrigSet-127 73.16%, 93.28% −, −FGSet-127 75.32%, 93.87% 81.25%, 95.83%BGSet-127 41.65%, 73.79% 18.36%, 39.84%

Table 2: Classification accuracy (in terms of top-1, top-5) on fivesets by deep neural networks and human, respectively.

Network OrigSet FGSet BGSetOrigNet 58.19%, 80.96% 50.73%, 74.11% 3.83%, 9.11%FGNet 33.42%, 53.72% 60.82%, 83.43% 1.44%, 4.53%BGNet 4.26%, 10.73% 1.69%, 5.34% 14.41%, 29.62%HybridNet 52.89%, 76.61% 61.29%, 83.85% 3.48%, 9.05%

Table 3: Cross evaluation accuracy (in terms of top-1, top-5) onfour networks and three testing sets. Note that the testing set ofHybridSet is identical to that of FGSet.

in a 127-class recognition problem meanwhile keeping thenumber of training/testing images unchanged. To distinguishthe merged 127-class datasets with the previous datasets, werefer to them as the OrigSet-127, FSet-127 and BGSet-127,respectively. Then we invite volunteers who are familiar withthe merged 127 classes to perform the recognition task onBGSet-127 and FSet-127. Humans are given 256 imagescovering all 127 classes and one image takes around two min-utes to make the top-5 decisions. We do not evaluate humanson OrigSet-127 since we believe humans can perform wellon this set like on OrigSet. Human performance on OrigSet(labeled by ?) is reported by [Russakovsky et al., 2015].

Table 2 gives the testing recognition performance of hu-man beings and trained AlexNet on different datasets. It iswell noted that humans are good at recognizing natural im-ages [Russakovsky et al., 2015], e.g., on OrigSet, human la-belers achieve much higher performance than AlexNet. Wecan find the human beings also surpass networks on the fore-ground (object-level) recognition by 5.93% and 1.96% interms of top-1 and top-5 accuracy. Surprisingly, AlexNetbeats human labelers to a large margin on the backgrounddataset BGSet-127 considering the 127% and 85% rela-tive improvements from 18.36% to 41.65% and 39.84% to73.79% for top-1 and top-5 accuracy, respectively. In thiscase, the networks are capable of exploring background hintsfor recognition much better than human beings. On the con-trary, humans classify images mainly based on the visual con-tents of the foreground objects.

4.2 Cross EvaluationTo study the difference in visual patterns learned by differentnetworks, we perform the cross evaluation, i.e., applying eachtrained network to different testing sets. Results are summa-rized in Table 3.

We find that the transferring ability of each network is lim-ited, since a model cannot obtain satisfying performance inthe scenario of different distributions between training andtesting data. For example, using FGNet to predict OrigSet

leads to 27.40% absolute drop (45.05% relative) in top-1 ac-curacy, meanwhile using OrigNet to predict FGSet leads to7.46% drop (12.82% relative) in top-1 accuracy. We conjec-ture that FGNet may store very little information on contexts,thus confused by the background context of OrigSet. On theother side, OrigNet has the ability of recognizing contextsbut is wasted for the task on FGSet.

The ratio of bounding box w.r.t the whole image0.2 0.4 0.6 0.8 1Th

e ac

cura

cy a

vera

ged

by c

lass

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7The top 1 accuracy

OrigNetFGNetBGNetHybridNet

The ratio of bounding box w.r.t the whole image0.2 0.4 0.6 0.8 1Th

e ac

cura

cy a

vera

ged

by c

lass

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

The top 5 accuracy

OrigNetFGNetBGNetHybridNet

Figure 2: Classification accuracy with respect to the foreground ratioon testing images. The number at, say, 0.3, represents the testingaccuracy on the set of all images with foreground ratio no greaterthan 30%. Best viewed in color.

4.3 DiagnosisWe conduct diagnostic experiments to study the property ofdifferent networks to fully understand the networks behav-iors. Specifically, we report the classification accuracy of dif-ferent networks with respect to keeping different foregroundratios of the testing image.

We split each testing dataset into 10 subsets, each of whichcontains all images with the foreground ratio no greater thana fixed value. Results are shown in Figure 2. BGNet getshigher classification accuracy on the images with a relativelysmaller foreground ratio, while other three networks prefera large object ratio since the foreground information is pri-marily learned for recognition in these cases. Furthermorewhen the foreground ratio goes larger, e.g., greater than 80%,


3612

Network Guided UnguidedOrigNet 58.19%, 80.96% 58.19%, 80.96%BGNet 14.41%, 29.62% 8.30%, 20.60%FGNet 60.82%, 83.43% 40.71%, 64.12%HybridNet 61.29%, 83.85% 45.58%, 70.22%FGNet+BGNet 61.75%, 83.88% 41.83%, 65.32%HybridNet+BGNet 62.52%, 84.53% 48.08%, 72.69%HybridNet+OrigNet 65.63%, 86.69% 60.36%, 82.47%

Table 4: Classification accuracy (in terms of top-1, top-5) compari-son of different network combinations. It’s worth noting that we feedthe entire image into the OrigNet no matter whether the ground-truth bounding box(es) is given in order to keep the testing phaseconsistent with the training of OrigNet. Therefore, the reported re-sults of OrigNet are same with each other under both guided andunguided conditions. To integrate the results from several networks,we weighted sum up the responses on the fc-8 layer.

the performance gap among OrigNet, FGNet and Hybrid-Net gets smaller.

4.4 Visualization

In this part, we visualize the networks to see how differentnetworks learn different visual patterns. We adopt a verystraightforward visualization method [Wang et al., 2015a],which takes a trained network and reference images as input.

We visualize the most significant responses of the neuronson the conv-5 layer. The conv-5 layer is composed of 256filter response maps, each of which has 13×13 different spa-tial positions. After all the 50,000 reference images are pro-cessed, we obtain 132 × 50000 responses for each of the 256filters. We pick up those neurons with the highest responseand trace back to obtain its receptive field on the input im-age. In this way, we can discover the visual patterns that bestdescribe the concept this filter has learned. For diversity, weonly choose at most one patch from a reference image withthe highest response score.

Figure 3 shows visualization results using FGNet onFGSet, BGNet on BGSet and OrigNet on OrigSet, respec-tively. We can observe quite different visual patterns learnedby these networks. The visual patterns learned by FGNetare often very specific to some object categories, such as thepatch of a dog face (filter 5) or the front side of a shop (fil-ter 11). These visual patterns correspond to some visual at-tributes, which are vital for recognition. However, each visualconcept learned by BGNet tends to appear in many differentobject categories, for instance, the patch of outdoor scene (fil-ter 8) shared by the jetty, viaduct, space shuttle, etc. Thesevisual patterns are often found in the context, which playsan assistant role in object recognition. As for OrigNet, thelearned patterns can be shared specific objects or scene.

To summarize, FGNet and BGNet learn different visualpatterns that can be combined to assist visual recognition.In Sec 4.3 we quantitatively demonstrate the effectivenessof these networks via combining these information for bet-ter recognition performance.

5 CombinationWe first show that the recognition accuracy can be signifi-cantly boosted using ground-truth bounding box(es) at thetesting stage. Next, with the help of the EdgeBox algo-rithm [Zitnick and Dollar, 2014] to generate accurate objectproposals, we improve the recognition performance withoutthe requirement of ground-truth annotations. We name themas guided and unguided combination, respectively.

5.1 Guided vs. Unguided CombinationWe start with describing guided and unguided manners of themodel combination. For simplicity, we adopt the linear com-bination over different models, i.e., forwarding several net-works, and weighted summing up the responses on the fc-8layer.

If the ground-truth bounding box is provided (the guidedcondition), we use the ground-truth bounding box to dividethe testing image into foreground and background regions.Then, we feed the foreground regions into FGNet or Hy-bridNet, and background regions into BGNet, then fuse theneuron responses at the final stage.

Furthermore, we also explore the solution of combiningmultiple networks in an unguided manner. As we will see inSec 5.2, a reliable bounding box helps a lot in object recog-nition. Motivated by which, we use an efficient and effectivealgorithm, EdgeBox, to generate a lot of potential boundingboxes proposals for each testing image, and then feed theforeground and background regions into neural networks asdescribed before across top proposals.

To begin with, we demonstrate the EdgeBox proposals aregood to capture the ground-truth object. After extracting top-k proposals with EdgeBox, we count the detected ground-truth if at least one of proposals has the IoU no less than 0.7with the ground-truth. The cumulative distribution function(CDF) is plotted in Figure 4. Considering efficiency as wellas accuracy, we choose the top-100 proposals to feed the fore-ground and background into trained networks, which give anaround 81% recall. After obtaining 100 outputs for each net-work, we average responses of fc-8 layer for classification.

5.2 Combination Results and DiscussionResults of different combinations are summarized in Table 4.Under either guided or unguided settings, combining multi-ple networks boosts recognition performance, which verifiesthe statement that different visual patterns from different net-works can help with each other for the object recognition.

Take a closer look at the accuracy gain under the un-guided condition. The combination of HybridNet+BGNetoutperforms HybridNet by 2.50% and 2.47% in terms oftop-1 and top-5 recognition accuracy, which are noticeablegains. As for the FGNet+BGNet, it improves 1.12% and1.20% classification accuracy compared with the FGNet,which are promising. Surprisingly, the combination of Hy-bridNet with OrigNet can still increase from the OrigNetby 2.17% and 1.51%. We hypothesize that the combina-tion is capable of discovering the objects implicitly by theinference of where the objects are due to the visual patternsof HybridNet are learned from images with object spatial


3613

Filter5

Filter11

Filter3

Filter8

Filter17

Filter21

FGNet on FGSet BGNet on BGSet OrigNet on OrigSet

Lhasa Shih-Tzu Shih-Tzu Spaniel

BookShop TobaShop CandStoBarShop

Pret Lemon PilBot Antich

SpaShu Rapes Viadu Jetty

EarStar Orange TenBall Orange

Abaya TenBall JerseyTench

Figure 3: Patch visualization of FGNet on FGSet (left), BGNet on BGSet (middle) and OrigNet on OrigSet (right). Each row correspondsto one filter on the conv-5 layer, and each patch is selected from 132 × 50000 ones, with the highest response on that kernel. Best viewed incolor.

0 50 100 150 200 250 300Top-k proposals

0

0.2

0.4

0.6

0.8

1

Perc

enta

ge o

f det

ecte

d gr

ound

trut

h

Recall curve of top-k proposals

Figure 4: EdgeBox statistics on ILSVRC2012 validation set, whichdenotes the curriculum distribution function of the detected ground-truth with respect to the top-k proposals. Here, we set the Intersec-tion over Union (IoU) threshold to be 0.7 for EdgeBox algorithm.

information. One may conjecture that the performance im-provement may come from the ensemble effect, which is notnecessarily true considering: 1) object proposals are not ac-curate enough; 2) data augmentation (5 crops and 5 flips) isalready done for the OrigNet, therefore the improvement iscomplementary to data augmentation. Moreover, we quan-titatively verify that the improvements are not from simpledata augmentation by giving the results of OrigNet averagedby 100 densely sampled patches (50 crops and correspond-ing 50 flips, 227 × 227 × 3, referred to as OrigNet100) in-stead of the default (5 crops and 5 flips) setting. The top-1and top-5 accuracy of OrigNet100 are 58.08% and 81.05%,which are very similar to original 58.19% and 80.96%. Thissuggests that the effect of data augmentation by 100 patchesis negligible. By contrast, HybridNet+OrigNet100 reports60.80% and 82.59%, significantly higher than OrigNet100alone, which reveals that HybridNet brings in some ben-efits that are not achieved via data augmentation. Theseimprovements are super promising considering that the net-works don’t know where the accurate objects are under theunguided condition. Notice that the results under unguidedcondition cannot surpass those under guided condition, ar-guably because the top-100 proposals not good enough to

capture the accurate ground-truth given that the BGNet can-not give high confidence on the predictions.

For the guided way of testing, by providing accurate sepa-ration of foreground from background, works better than theunguided way by a large margin, which makes sense. And theimprovements can consistently be found after combinationswith the BGNet. It is well worth noting that the combina-tion of HybridNet with OrigGNet improves the baseline ofOrigGNet to a significant margin by 7.44% and 5.73%. Thehuge gains are reasonable because of networks’ ability to in-fer object locations trained on accurate bounding box(es).

6 Conclusions and Future WorkIn this work, we first demonstrate the surprising finding thatneural networks can predict object categories quite well evenwhen the object is not present. This motivates us to study thehuman recognition performance on foreground with objectsand background without objects. We show on the 127-classesILSVRC2012 that human beings beat neural networks forforeground object recognition, while perform much worse topredict the object category only on the background withoutobjects. Then explicitly combining the visual patterns learnedfrom different networks can help each other for the recogni-tion task. We claim that more emphasis should be placed onthe role of contexts for object detection and recognition.

In the future, we will investigate an end-to-end training ap-proach for explicitly separating and then combining the fore-ground and background information, which explores the vi-sual contents to the full extent. For instance, inspired bysome joint learning strategy such as Faster R-CNN [Ren etal., 2015], we can design a structure which predicts the ob-ject proposals in the intermediate stage, then learns the fore-ground and background regions derived from the proposalsseparately by two sub-networks and then takes foregroundand background features into further consideration.

AcknowledgmentsThis work is supported by the Intelligence Advanced Re-search Projects Activity (IARPA) via Department of In-terior/Interior Business Center (DoI/IBC) contract numberD16PC00007. We greatly thank the anonymous reviewersand JHU CCVL members who have given valuable and con-structive suggestions which make this work better.


3614

References[Bewley and Upcroft, 2017] A. Bewley and B. Upcroft. Back-

ground appearance modeling with applications to visual objectdetection in an open-pit mine. JFR, 2017.

[Cao et al., 2015] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang,Y. Huang, L. Wang, C. Huang, and W. Xu. Look and ThinkTwice: Capturing Top-Down Visual Attention with FeedbackConvolutional Neural Networks. ICCV, 2015.

[Csurka et al., 2004] G. Csurka, C. Dance, L. Fan, J. Willamowski,and C. Bray. Visual Categorization with Bags of Keypoints.Workshop on ECCV, 2004.

[Dalal and Triggs, 2005] N. Dalal and B. Triggs. Histograms ofOriented Gradients for Human Detection. CVPR, 2005.

[Deng et al., 2009] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li,and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical ImageDatabase. CVPR, 2009.

[Doersch et al., 2014] C. Doersch, A. Gupta, and A.A. Efros. Con-text as supervisory signal: Discovering objects with predictablecontext. ECCV, 2014.

[Fang et al., 2015] Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu,and E. Wong. 3d deep shape descriptor. In CVPR, 2015.

[Griffin et al., 2007] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Technical Report: CNS-TR-2007-001, Caltech, 2007.

[Han et al., 2015] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, andF. Wu. Background prior-based salient object detection via deepreconstruction residual. TCSVT, 2015.

[He et al., 2016a] K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-ual Learning for Image Recognition. CVPR, 2016.

[He et al., 2016b] K. He, X. Zhang, S. Ren, and J. Sun. Identitymappings in deep residual networks. In ECCV, 2016.

[Huh et al., 2016] M. Huh, P. Agrawal, and A.A. Efros. WhatMakes ImageNet Good for Transfer Learning? arXiv:1608.08614, 2016.

[Ioffe and Szegedy, 2015] S. Ioffe and C. Szegedy. Batch Normal-ization: Accelerating Deep Network Training by Reducing Inter-nal Covariate Shift. ICML, 2015.

[Jia et al., 2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,J. Long, R. Girshick, S. Guadarrama, and T. Darrell. CAFFE:Convolutional Architecture for Fast Feature Embedding. ACMMM, 2014.

[Krizhevsky et al., 2012] A. Krizhevsky, I. Sutskever, and G.E.Hinton. ImageNet Classification with Deep Convolutional Neu-ral Networks. NIPS, 2012.

[Lazebnik et al., 2006] S. Lazebnik, C. Schmid, and J. Ponce. Be-yond Bags of Features: Spatial Pyramid Matching for Recogniz-ing Natural Scene Categories. CVPR, 2006.

[LeCun et al., 1990] Y. LeCun, J.S. Denker, D. Henderson, R.E.Howard, W. Hubbard, and L.D. Jackel. Handwritten Digit Recog-nition with a Back-Propagation Network. NIPS, 1990.

[Lin et al., 2015] T. Lin, A. RoyChowdhury, and S. Maji. Bilinearcnn models for fine-grained visual recognition. ICCV, 2015.

[Lowe, 2004] D.G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 2004.

[Nair and Hinton, 2010] V. Nair and G.E. Hinton. Rectified linearunits improve restricted boltzmann machines. ICML, 2010.

[Nilsback and Zisserman, 2008] M.E. Nilsback and A. Zisserman.Automated Flower Classification over a Large Number ofClasses. ICVGIP, 2008.

[Perronnin et al., 2010] F. Perronnin, J. Sanchez, and T. Mensink.Improving the Fisher Kernel for Large-scale Image Classifica-tion. ECCV, 2010.

[Ren et al., 2015] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-time Object Detection with Region ProposalNetworks. NIPS, 2015.

[Russakovsky et al., 2015] O. Russakovsky, J. Deng, H. Su,J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, A.C. Berg, and L. Fei-Fei. ImageNet Large ScaleVisual Recognition Challenge. IJCV, pages 1–42, 2015.

[Sanderson and Lovell, 2009] C. Sanderson and B.C. Lovell.Multi-region probabilistic histograms for robust and scalableidentity inference. ICB, 2009.

[Shelhamer et al., 2016] E. Shelhamer, J. Long, and T. Darrell.Fully convolutional networks for semantic segmentation. TPAMI,2016.

[Simonyan and Zisserman, 2014] K. Simonyan and A. Zisserman.Very Deep Convolutional Networks for Large-Scale ImageRecognition. ICLR, 2014.

[Szegedy et al., 2015] C. Szegedy, W. Liu, Y. Jia, P. Sermanet,S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich. Going Deeper with Convolutions. CVPR, 2015.

[Vedaldi and Lenc, 2015] A. Vedaldi and K. Lenc. MatConvNet –Convolutional Neural Networks for MATLAB. In ACM MM,2015.

[Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, andS. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Tech-nical Report: CNS-TR-2011-001, Caltech, 2011.

[Wang et al., 2010] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, andY. Gong. Locality-Constrained Linear Coding for Image Classi-fication. CVPR, 2010.

[Wang et al., 2015a] J. Wang, Z. Zhang, V. Premachandran, andA. Yuille. Discovering Internal Representations from Object-CNNs Using Population Encoding. arXiv: 1511.06855, 2015.

[Wang et al., 2015b] X. Wang, Z. Zhu, C. Yao, and X. Bai. Re-laxed multiple-instance svm with application to object discovery.ICCV, 2015.

[Xiao et al., 2010] J. Xiao, J. Hays, K.A. Ehinger, A. Oliva, andA. Torralba. SUN Database: Large-Scale Scene Recognitionfrom Abbey to Zoo. CVPR, 2010.

[Xie et al., 2017] L. Xie, J. Wang, W. Lin, B. Zhang, and Q. Tian.Towards reversal-invariant image representation. IJCV, 2017.

[Zeiler and Fergus, 2014] M.D. Zeiler and R. Fergus. Visualizingand Understanding Convolutional Networks. ECCV, 2014.

[Zhou et al., 2015] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, andA. Torralba. Object Detectors Emerge in Deep Scene CNNs.ICLR, 2015.

[Zhu et al., 2016] Z. Zhu, X. Wang, S. Bai, C. Yao, and X. Bai.Deep learning representation using autoencoder for 3d shape re-trieval. Neurocomputing, 2016.

[Zitnick and Dollar, 2014] C.L. Zitnick and P. Dollar. Edge Boxes:Locating Object Proposals from Edges. ECCV, 2014.


3615

Object Recognition with and without Objects · Object recognition is a long-lasting battle in...

Documents

Transcript of Object Recognition with and without Objects · Object recognition is a long-lasting battle in...