Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep...

10
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1120 CVPR #1120 CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Compositional Convolutional Neural Networks: A Deep Architecture with Innate Robustness to Partial Occlusion Anonymous CVPR submission Paper ID 1120 Abstract Recent work has shown that deep convolutional neural networks (DCNNs) do not generalize well under partial oc- clusion. Inspired by the success of compositional models at classifying partially occluded objects, we propose to inte- grate compositional models and DCNNs into a unified deep model with innate robustness to partial occlusion. We term this architecture Compositional Convolutional Neural Net- work. In particular, we propose to replace the fully con- nected classification head of a DCNN with a differentiable compositional model. The generative nature of the compo- sitional model enables it to localize occluders and subse- quently focus on the non-occluded parts of the object. We conduct classification experiments on artificially occluded images as well as real images of partially occluded objects from the MS-COCO dataset. The results show that DC- NNs do not classify occluded objects robustly, even when trained with data that is strongly augmented with partial occlusions. Our proposed model outperforms standard DC- NNs by a large margin at classifying partially occluded ob- jects, even when it has not been exposed to occluded ob- jects during training. Additional experiments demonstrate that CompositionalNets can also localize the occluders ac- curately, despite being trained with class labels only. 1. Introduction Advances in the architecture design of deep convolu- tional neural networks (DCNNs) [14, 18, 8] increased the performance of computer vision systems at image classifi- cation enormously. However, recent findings [31, 11] have shown that such deep models are significantly less robust at classifying artificially occluded objects compared to Hu- mans. Furthermore, our experiments show that DCNNs do not classify real images of partially occluded objects robustly. Thus, our findings and those of related works [31, 11] point out a fundamental limitation of DCNNs in terms of generalization under partial occlusion which needs to be addressed. Figure 1: Partially occluded cars from the MS-COCO dataset [17] that are misclassified by a standard DCNN but correctly classified by the proposed CompositionalNet. Intuitively, a CompositionalNet can localize the occluders (occlusion scores on the right) and subsequently focus on the non-occluded parts of the object to classify the image. One approach to overcome this limitation is to use data augmentation in terms of partial occlusion [4, 28]. How- ever, our experimental results show that after training with augmented data the performance of DCNNs at classifying partially occluded objects still remains substantially worse compared to the classification of non-occluded objects. A number of works showed that compositional mod- els can robustly classify partially occluded 2D patterns [7, 10, 23, 30]. Kortylewski et al. [11] proposed dictionary- based compositional models, a generative model of neural feature activations that can classify images of partially oc- cluded 3D objects more robustly than DCNNs. However, their results also showed that their model is significantly less discriminative at classifying non-occluded objects com- pared to DCNNs. In this work, we propose to integrate compositional models and DCNNs into a unified deep model with innate robustness to partial occlusion. In particular, we propose to replace the fully-connected classification head of a DCNN with a compositional layer that is regularized to be fully 1

Transcript of Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep...

Page 1: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Compositional Convolutional Neural Networks:A Deep Architecture with Innate Robustness to Partial Occlusion

Anonymous CVPR submission

Paper ID 1120

Abstract

Recent work has shown that deep convolutional neuralnetworks (DCNNs) do not generalize well under partial oc-clusion. Inspired by the success of compositional models atclassifying partially occluded objects, we propose to inte-grate compositional models and DCNNs into a unified deepmodel with innate robustness to partial occlusion. We termthis architecture Compositional Convolutional Neural Net-work. In particular, we propose to replace the fully con-nected classification head of a DCNN with a differentiablecompositional model. The generative nature of the compo-sitional model enables it to localize occluders and subse-quently focus on the non-occluded parts of the object. Weconduct classification experiments on artificially occludedimages as well as real images of partially occluded objectsfrom the MS-COCO dataset. The results show that DC-NNs do not classify occluded objects robustly, even whentrained with data that is strongly augmented with partialocclusions. Our proposed model outperforms standard DC-NNs by a large margin at classifying partially occluded ob-jects, even when it has not been exposed to occluded ob-jects during training. Additional experiments demonstratethat CompositionalNets can also localize the occluders ac-curately, despite being trained with class labels only.

1. IntroductionAdvances in the architecture design of deep convolu-

tional neural networks (DCNNs) [14, 18, 8] increased theperformance of computer vision systems at image classifi-cation enormously. However, recent findings [31, 11] haveshown that such deep models are significantly less robustat classifying artificially occluded objects compared to Hu-mans. Furthermore, our experiments show that DCNNsdo not classify real images of partially occluded objectsrobustly. Thus, our findings and those of related works[31, 11] point out a fundamental limitation of DCNNs interms of generalization under partial occlusion which needsto be addressed.

Figure 1: Partially occluded cars from the MS-COCOdataset [17] that are misclassified by a standard DCNNbut correctly classified by the proposed CompositionalNet.Intuitively, a CompositionalNet can localize the occluders(occlusion scores on the right) and subsequently focus onthe non-occluded parts of the object to classify the image.

One approach to overcome this limitation is to use dataaugmentation in terms of partial occlusion [4, 28]. How-ever, our experimental results show that after training withaugmented data the performance of DCNNs at classifyingpartially occluded objects still remains substantially worsecompared to the classification of non-occluded objects.

A number of works showed that compositional mod-els can robustly classify partially occluded 2D patterns[7, 10, 23, 30]. Kortylewski et al. [11] proposed dictionary-based compositional models, a generative model of neuralfeature activations that can classify images of partially oc-cluded 3D objects more robustly than DCNNs. However,their results also showed that their model is significantlyless discriminative at classifying non-occluded objects com-pared to DCNNs.

In this work, we propose to integrate compositionalmodels and DCNNs into a unified deep model with innaterobustness to partial occlusion. In particular, we propose toreplace the fully-connected classification head of a DCNNwith a compositional layer that is regularized to be fully

1

Page 2: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

generative in terms of the neural feature activations of thelast convolutional layer. The generative property of thecompositional layer enables the network to localize occlud-ers in an image and subsequently focus on the non-occludedparts of the object in order to classify the image robustly.We term this novel deep architecture Compositional Con-volutional Neural Network (CompositionalNet). Figure 1illustrates the robustness of CompositionalNets at classify-ing partially occluded objects, while also being able to lo-calize occluders in an image. In particular, it shows severalimages of cars that are occluded by other objects. Next tothese images, we show occlusion scores that illustrate theposition of occluders as estimated by the Compositional-Net. Note how the occluders are accurately localized in theocclusion maps despite having highly complex shapes andappearances.

Our extensive experiments demonstrate that the pro-posed CompositionalNet outperforms related approachesby a large margin at classifying partially occluded objects,even when it has not been exposed to occluded objects dur-ing training. When trained with data augmentation in termsof partial occlusion the performance increases further. Inaddition, we perform qualitative and quantitative experi-ments that demonstrate the ability of CompositionalNetsto localize occluders accurately, despite being trained withclass labels only. We make several important contributionsin this paper:

1. We propose a differentiable compositional modelthat is generative in terms of the feature activations ofa DCNN . This enables us to integrate compositionalmodels and deep networks into compositional convo-lutional neural networks, a unified deep model withinnate robustness to partial occlusion.

2. While previous works [30, 11, 27, 31] evaluate ro-bustness to partial occlusion on artificially occludedimages only, we also evaluate on real imagesof partially occluded objects from the MS-COCOdataset. We demonstrate that CompositionalNetsachieve state-of-the-art results at classifying par-tially occluded objects under artificial and real occlu-sion.

3. To the best of our knowledge we are the first to studythe task of localizing occluders in an image and showthat CompositionalNets outperform dictionary-basedcompositional models [11] substantially.

2. Related WorkClassification under partial occlusion. Recent work

[31, 11] has shown that current deep architectures are signif-icantly less robust to partial occlusion compared to Humans.Fawzi and Frossard [5] showed that DCNNs are vulnerable

to partial occlusion simulated by masking small patches ofthe input image. Related works [4, 28], have proposed toaugment the training data with partial occlusion by mask-ing out patches from the image during training. However,our experimental results in Section 4 show that such dataaugmentation approaches only have limited effects on therobustness of a DCNN to partial occlusion. A possible ex-planation is the difficulty of simulating occlusion due to thelarge variability of occluders in terms of appearance andshape. Xiao et al. [27] proposed TDAPNet a deep net-work with an attention mechanism that masks out occludedfeatures in lower layers to increase the robustness of theclassification against occlusion. Our results show that thismodel does not perform well on images with real occlu-sion. In contrast to deep learning approaches, generativecompositional models [9, 32, 6, 2, 13] have been shown tobe inherently robust to partial occlusion when augmentedwith a robust occlusion model [10]. Such models have beensuccessfully applied for detecting partially occluded objectparts [23, 30] and for recognizing 2D patterns under partialocclusion [7, 12].

Combining compositional models and DCNNs. Liaoet al. [16] proposed to integrate compositionality into DC-NNs by regularizing the feature representations of DCNNsto cluster during learning. Their qualitative results showthat the resulting feature clusters resemble part-like detec-tors. Zhang et al. [29] demonstrated that part detectorsemerge in DCNNs by restricting the activations in featuremaps to have a localized distribution. However, these ap-proaches have not been shown to enhance the robustness ofdeep models to partial occlusion. Related works proposedto regularize the convolution kernels to be sparse [20], orto force feature activations to be disentangled for differ-ent objects [19]. As the compositional model is not ex-plicit but rather implicitly encoded within the parametersof the DCNNs, the resulting models remain black-box DC-NNs that are not robust to partial occlusion. A number ofworks [15, 21, 22] use differentiable graphical models tointegrate part-whole compositions into DCNNs. However,these models are purely discriminative and thus also aredeep networks with no internal mechanism to account forpartial occlusion. Kortylewski et al. [11] proposed learn agenerative dictionary-based compositional models from thefeatures of a DCNN. They use their compositional model as“backup” to an independently trained DCNN, if the DCNNsclassification score falls below a certain threshold.

In this work, we propose to integrate generative com-positional models and DCNNs into a unified model that isinherently robust to partial occlusion. In particular, we pro-pose to replace the fully connected classification head with adifferentiable compositional model. We train the model pa-rameters with backpropagation, while regularizing the com-positional model to be generative in terms of the neural fea-

2

Page 3: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

ture activations of the last convolution layer. Our proposedmodel significantly outperforms related approaches at clas-sifying partially occluded objects while also being able tolocalize occluders accurately.

3. Compositional Convolutional Neural NetsIn Section 3.1, we introduce a fully generative compo-

sitional model and discuss how it can be integrated withDCNNs in an end-to-end system in Section 3.2.

3.1. Fully Generative Compositional Models

We denote a feature map F l ∈ RH×W×D as the outputof a layer l in a DCNN, with D being the number of chan-nels. A feature vector f lp ∈ RD is the vector of features inF l at position p on the 2D lattice P of the feature map. Inthe remainder of this section we omit the superscript l fornotational clarity because this is fixed a-priori.

We propose a differentiable generative compositionalmodel of the feature activations p(F |y) for an object class y.This is different from dictionary-based compositional mod-els [11] which learn a model p(B|y), where B is a non-differentiable binary approximation of F . In contrast, wemodel the real-valued feature activations p(F |y) as a mix-ture of von-Mises-Fisher (vMF) distributions:

p(F |θy) =∏p

p(fp|Ap,y,Λ) (1)

p(fp|Ap,y,Λ) =∑k

αp,k,yp(fp|λk), (2)

where θy = {Ay,Λ} are the model parameters and Ay ={Ap,y} are the parameters of the mixture models at everyposition p ∈ P on the 2D lattice of the feature map F . Inparticular, Ap,y = {αp,0,y, . . . , αp,K,y|

∑Kk=0 αp,k,y = 1}

are the mixture coefficients, K is the number of mixturecomponents and Λ = {λk = {σk, µk}|k = 1, . . . ,K} arethe parameters of the vMF distribution:

p(fp|λk) =eσkµ

Tk fp

Z(σk), ‖fp‖ = 1, ‖µk‖ = 1, (3)

where Z(σk) is the normalization constant. The parame-ters of the vMF distribution Λ can be learned by iteratingbetween vMF clustering of the feature vectors of all train-ing images and maximum likelihood parameter estimation[1] until convergence. After training, the vMF cluster cen-ters {µk} will resemble feature activation patterns that fre-quently occur in the training data. Interestingly, feature vec-tors that are similar to one of the vMF cluster centers, areoften induced by image patches that are similar in appear-ance and often even share semantic meanings (see Supple-mentary A). This property was also observed in a numberof related works that used clustering in the neural featurespace [24, 16, 23].

The mixture coefficients αp,k,y can also be learned withmaximum likelihood estimation from the training images.They describe the expected activation of a cluster center µkat a position p in a feature map F for a class y. Note that thespatial information from the image is preserved in the fea-ture maps. Hence, our proposed vMF model (Equation 1)intuitively describes the expected spatial activation patternof parts in an image for a given class y - e.g. where the tiresof a car are expected to be located in an image. In Section3.2, we discuss how the maximum likelihood estimation ofthe parameters θy can be integrated into a loss function andoptimized with backpropagation.

Mixture of compositional models. The model in Equa-tion 1 assumes that the 3D pose of an object is approxi-mately constant in images. This is a common assumptionof generative models that represent objects in image space.We can represent 3D objects with a generalized model usingmixtures of compositional models as proposed in [11]:

p(F |Θy) =∑m

νmp(F |θmy ), (4)

with V={νm ∈ {0, 1},∑m ν

m=1} and Θy = {θmy ,m =1, . . . ,M}. Here M is the number of mixtures of compo-sitional models and νm is a binary assignment variable thatindicates which mixture component is active. Intuitively,each mixture component m will represent a different view-point of an object (see Supplementary B). The parametersof the mixture components {Amy } need to be learned in anEM-type manner by iterating between estimating the as-signment variables V and maximum likelihood estimationof {Amy }. We discuss how this process can be performed ina neural network in Section 3.2.

Occlusion modeling. Following the approach presentedin [10], compositional models can be augmented with anocclusion model. The intuition behind an occlusion modelis that at each position p in the image either the object modelp(fp|Amp,y,Λ) or an occluder model p(fp|β,Λ) is active:

p(F |θmy , β)=∏p

p(fp, zmp =0)1−zmp p(fp, z

mp =1)z

mp , (5)

p(fp, zmp =1) = p(fp|β,Λ) p(zmp =1), (6)

p(fp, zmp =0) = p(fp|Amp,y,Λ) (1-p(zmp =1)). (7)

The binary variables Zm = {zmp ∈ {0, 1}|p ∈ P} indicateif the object is occluded at position p for mixture componentm. The occlusion prior p(zmp =1) is fixed a-priori. Relatedworks [10, 11] use a single occluder model. We instead usea mixture of several occluder models that are learned in anunsupervised manner:

p(fp|β,Λ) =∏n

p(fp|βn,Λ)τn (8)

=∏n

(∑k

βn,kp(fp|σk, µk))τn

, (9)

3

Page 4: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2: Feed-forward inference with a CompositionalNet. A DCNN backbone is used to extract the feature map F ,followed by a convolution with the vMF kernels {µk} and a non-linear vMF activation function N (·). The resulting vMFlikelihood L is used to compute the occlusion likelihood O using the occluder kernels {βn}. Furthermore, L is used tocompute the mixture likelihoods {Emy } using the mixture models {Amy }. O and {Emy } compete in explaining L (red box)and are combined to compute an occlusion robust score {smy }. The binary occlusion maps {Zmy } indicate which positions inL are occluded. The final class score sy is computed as sy=maxm s

my and the occlusion map Zy is selected accordingly.

where {τn ∈ {0, 1},∑n τn = 1} indicates which occluder

model explains the data best. The parameters of the oc-cluder models βn are learned from clustered features of ran-dom natural images that do not contain any object of inter-est (see Supplementary C). Note that the model parametersβ are independent of the position p in the feature map andthus the model has no spatial structure. Hence, the mixturecoefficients βn,k intuitively describe the expected activationof µk anywhere in natural images.

Inference as feed-forward neural network. The com-putational graph of our proposed fully generative compo-sitional model is a directed acyclic graph. Hence, we canperform inference in a single forward pass as illustrated inFigure 2.

We use a standard DCNN backbone to extract a featurerepresentation F = ψ(I, ω) ∈ RH×W×D from the inputimage I , where ω are the parameters of the feature extrac-tor. The vMF likelihood function p(fp|λk) (Equation 3) iscomposed of two operations: An inner product ip,k = µTk fpand a non-linear transformation N = exp(σkip,k)/Z(σk).Since µk is independent of the position p, computing ip,k isequivalent to a 1× 1 convolution of F with µk. Hence, thevMF likelihood can be computed by:

L = {N (F ∗ µk)|k = 1, . . . ,K} ∈ RH×W×K (10)

(Figure 2 yellow tensor). The mixture likelihoodsp(fp|Amp,y,Λ) (Equation 2) are computed for every positionp as a dot-product between the mixture coefficients Amp,yand the corresponding vector lp ∈ RK from the likelihood

tensor:

Emy = {lTpAmp,y|∀p ∈ P} ∈ RH×W , (11)

(Figure 2 blue planes). Similarly, the occlusion likelihoodcan be computed as O = {maxn l

Tp βn|∀p ∈ P} ∈ RH×W

(Figure 2 red plane). Together, the occlusion likelihoodO and the mixture likelihoods {Emy } are used to estimatethe overall likelihood of the individual mixtures as smy =p(F |θmy , β) =

∑p max(Emp,y, Op). The final model likeli-

hood is computed as sy = p(F |Θy) = maxm smy and the

final occlusion map is selected accordingly as Zy = Zmy ∈RH×W where m = argmaxm s

my .

3.2. End-to-end Training of CompositionalNets

We integrate our compositional model with DCNNs intoCompositional Convolutional Neural Networks (Composi-tionalNets) by replacing the classical fully connected clas-sification head with a compositional model head as illus-trated in Figure 2. The model is fully differentiable and canbe trained end-to-end using backpropagation. The trainableparameters of a CompositionalNet are T = {ω,Λ,Ay}. Weoptimize those parameters jointly using stochastic gradientdescent. The loss function is composed of four terms:

L(y, y′, F, T ) =Lclass(y, y′) + γ1Lweight(ω)+ (12)γ2Lvmf (F,Λ) + γ3Lmix(F,Ay). (13)

Lclass(y, y′) is the cross-entropy loss between the networkoutput y′ and the true class label y. Lweight = ‖ω‖22 isa weight regularization on the DCNN parameters. Lvmfand Lmix regularize the parameters of the compositional

4

Page 5: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

model to have maximal likelihood for the features in F .{γ1, γ2, γ3} control the trade-off between the loss terms.

The vMF cluster centers µk are learned by maximizingthe vMF-likelihoods (Equation 3) for the feature vectors fpin the training images. We keep the vMF variance σk con-stant, which also reduces the normalization term Z(σk) to aconstant. We assume a hard assignment of the feature vec-tors fp to the vMF clusters during training. Hence, the freeenergy to be minimized for maximizing the vMF likelihood[25] is:

Lvmf (F,Λ) = −∑p

maxk

log p(fp|µk) (14)

= C∑p

minkµTk fp, (15)

where C is a constant. Intuitively, this loss encourages thecluster centers µk to be similar to the feature vectors fp.

In order to learn the mixture coefficients Amy we need tomaximize the model likelihood (Equation 4). We can avoidan iterative EM-type learning procedure by making use ofthe fact that the the mixture assignment νm and the occlu-sion variables zp have been inferred in the forward inferenceprocess. Furthermore, the parameters of the occluder modelare learned a-priori and then fixed. Hence the energy to beminimized for learning the mixture coefficients is:

Lmix(F,Ay) =-∑p

(1-z↑p) log[∑k

αm↑

p,k,yp(fp|λk)]

(16)

Here, z↑p and m↑ denote the variables that were inferred inthe forward process (Figure 2).

4. ExperimentsWe perform experiments at the tasks of classifying par-

tially occluded objects and at occluder localization.Datasets. For evaluation we use the Occluded-Vehicles

dataset as proposed in [24] and extended in [11]. Thedataset consists of images and corresponding segmentationsof vehicles from the PASCAL3D+ dataset [26] that weresynthetically occluded with four different types of occlud-ers: segmented objects as well as patches with constantwhite color, random noise and textures (see Figure 5 forexamples). The amount of partial occlusion of the objectvaries in four different levels: 0% (L0), 20-40% (L1), 40-60% (L2), 60-80% (L3).

While it is reasonable to evaluate occlusion robustnessby testing on artificially generated occlusions, it is neces-sary to study the performance of algorithms under realisticocclusion as well. Therefore, we introduce a dataset withimages of real occlusions which we term Occluded-COCO-Vehicles. It consists of the same classes as the Occluded-Vehicle dataset. The images were generated by cropping

Figure 3: Images from the Occluded-COCO-Vehiclesdataset. Each row shows samples of one object class withincreasing amount of partial occlusion: 20-40% (Level-1),40-60% (Level-2), 60-80% (Level-3).

out objects from the MS-COCO [17] dataset based on theirbounding box. The objects are categorized into the fourocclusion levels defined by the Occluded-Vehicles datasetbased on the amount of the object that is visible in the image(using the segmentation masks available in both datasets).The number of test images per occlusion level are: 2036(L0), 768 (L1), 306 (L2), 73 (L3). For training purpose,we define a separate training dataset of 2036 images fromlevel L0. Figure 3 illustrates some example images fromthis dataset.

Training setup. CompositionalNets are trained from thefeature activations of a VGG-16 [18] model that is pre-trained on ImageNet[3]. We initialize the compositionalmodel parameters {µk}, {Ay} using clustering as describedin Section 3.1 and set the vMF variance to σk = 30,∀k ∈{1, . . . ,K}. We train the three sets of model parameters{ω, {µk}, {Ay}} using backpropagation. In order to reducethe memory consumption during training, we perform coor-dinate gradient descent and hence learn one set of variablesfor one epoch while keeping the other two fixed. We learnthe parameters of n = 5 occluder models {β1, . . . , βn} inan unsupervised manner as described in Section 3.1 andkeep them fixed throughout the experiments. We set thenumber of mixture components to M = 4. The mixingweights of the loss are chosen to be: γ1 = 0.1, γ2 = 5,γ3 = 1. We train for 60 epochs using stochastic gradi-ent descent with momentum r = 0.9 and a learning rate oflr = 0.01. The code, models and data used for out experi-ments are publicly available1.

4.1. Classification under Partial Occlusion

PASCAL3D+. In Table 1 we compare our Composi-tionalNets to a VGG-16 network that was pre-trained onImageNet and fine-tuned with the respective training data.

1Website will be written here.

5

Page 6: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

PASCAL3D+ Vehicles Classification under OcclusionOcc. Area L0: 0% L1: 20-40% L2: 40-60% L3: 60-80% MeanOcc. Type - w n t o w n t o w n t oVGG 99.2 96.9 97.0 96.5 93.8 92.0 90.3 89.9 79.6 67.9 62.1 59.5 62.2 83.6CoD[11] 92.1 92.7 92.3 91.7 92.3 87.4 89.5 88.7 90.6 70.2 80.3 76.9 87.1 87.1VGG+CoD [11] 98.3 96.8 95.9 96.2 94.4 91.2 91.8 91.3 91.4 71.6 80.7 77.3 87.2 89.5TDAPNet [27] 99.3 98.4 98.9 98.5 97.4 96.1 97.5 96.6 91.6 82.1 88.1 82.7 79.8 92.8CompNet-p4 97.4 96.7 96.0 95.9 95.5 95.8 94.3 93.8 92.5 86.3 84.4 82.1 88.1 92.2CompNet-p5 99.3 98.4 98.6 98.4 96.9 98.2 98.3 97.3 88.1 90.1 89.1 83.0 72.8 93.0CompNet-Multi 99.3 98.6 98.6 98.8 97.9 98.4 98.4 97.8 94.6 91.7 90.7 86.7 88.4 95.4

Table 1: Classification results for vehicles of PASCAL3D+ with different levels of artificial occlusion (0%,20-40%,40-60%,60-80% of the object are occluded) and different types of occlusion (w=white boxes, n=noise boxes, t=textured boxes,o=natural objects). CompositionalNets outperform related approaches significantly.

MS-COCO Vehicles Classification under OcclusionTrain Data PASCAL3D+ MS-COCO MS-COCO + CutOut MS-COCO + CutPasteOcc. Area L0 L1 L2 L3 Avg L0 L1 L2 L3 Avg L0 L1 L2 L3 Avg L0 L1 L2 L3 AvgVGG 97.8 86.8 79.1 60.3 81.0 99.1 88.7 78.8 63.0 82.4 99.3 90.9 87.5 75.3 88.3 99.3 92.3 89.9 80.8 90.6CoD 91.8 82.7 83.3 76.7 83.6 - - - - - - - - - - - - - - -VGG+CoD 98.0 88.7 80.7 69.9 84.3 - - - - - - - - - - - - - - -TDAPNet 98.0 88.5 85.0 74.0 86.4 99.4 88.8 87.9 69.9 86.5 99.3 90.1 88.9 71.2 87.4 98.1 89.2 90.5 79.5 89.3CompNet-p4 96.6 91.8 85.6 76.7 87.7 97.7 92.2 86.6 82.2 89.7 97.8 91.9 87.6 79.5 89.2 98.3 93.8 88.6 84.9 91.4CompNet-p5 98.2 89.1 84.3 78.1 87.5 99.1 92.5 87.3 82.2 90.3 99.3 93.2 87.6 84.9 91.3 99.4 93.9 90.6 90.4 93.5CompNet-Mul 98.5 93.8 87.6 79.5 89.9 99.4 95.3 90.9 86.3 93.0 99.4 95.2 90.5 86.3 92.9 99.4 95.8 91.8 90.4 94.4

Table 2: Classification results for vehicles of MS-COCO with different levels of real occlusion (L0: 0%,L1: 20-40%,L2 40-60%, L3:60-80% of the object are occluded). The training data consists of images from: PASCAL3D+, MS-COCO as wellas data from MS-COCO that was augmented with CutOut and CutPaste. CompositionalNets outperform related approachesin all test cases.

Furthermore, we compare to a dictionary-based compo-sitional model (CoD) and a combination of both models(VGG+CoD) as reported in [11]. We also list the resultsof TDAPNet as reported in [27]. We report results ofCompositionalNets learned from the pool4 and pool5layer of the VGG-16 network respectively (CompNet-p4 &CompNet-p5), as well as as a multi-layer CompositionalNet(CompNet-Multi) that is trained by combining the output ofCompNet-p4 and CompNet-p5. In this setup, all models aretrained with non-occluded images (L0), while at test timethe models are exposed to images with different amount ofpartial occlusion (L0-L3).

We observe that CompNet-p4 and CompNet-p5 outper-form VGG-16, CoD as well as the combination of both sig-nificantly. Note how the CompositionalNets are much morediscriminative at level L0 compared to dictionary-basedcompositional models. While CompNet-p4 and CompNet-p5 perform on par with the TDAPNet, CompNet-Multi out-performs TDAPNet significantly. We also observe thatCompNet-p5 outperforms CompNet-p4 for low occlusions(L0 & L1) and for stronger occlusions if the occludersare rectangular masks. However, CompNet-p4 outperformsCompNet-p5 at strong occlusion (L2 & L3) when the oc-

cluders are objects. As argued by Xiao et al. [27] thiscould be attributed to the fact that occluders with more fine-grained shapes disturb the features in higher layers moreseverely.

MS-COCO. Table 2 shows classification results undera realistic occlusion scenario by testing on the Occluded-COCO-Vehicles dataset. The models in the first part ofthe Table are trained on non-occluded images of the PAS-CAL3D+ data and evaluated on the MS-COCO data. Whilethe performance drops for all models in this transfer learn-ing setting, CompositionalNets still outperform the otherapproaches significantly. Note that the combination ofthe DCNN with the dictionary-based compositional model(VGG+CoD) has a high performance at low occlusionL0&L1 but lower performance for L2&L3 compared toCoD only.

The second part of the table (MS-COCO) shows theclassification performance after fine-tuning on the L0 train-ing set of the Occluded-COCO-Vehicles dataset. VGG-16achieves a similar performance as for the artificial object oc-cluders in Table 1. After fine-tuning, TDAPNet improves atlevel L0 and decreases on average for the levels L1 − 3.Overall it does not significantly benefit from fine-tuning

6

Page 7: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

with non-occluded images. The performance of the Com-positionalNet increases substantially (p4: 3%, p5: 2.8%,multi: 3.1%) after fine-tuning.

The third and fourth parts of Table 2 (MS-COCO-CutOut& MS-COCO-CutPaste) show classification results aftertraining with strong data augmentation in terms of partialocclusion. In particular, we use CutOut [4] regulariza-tion by masking out random square patches of size 70 pix-els. Furthermore, we propose a stronger data augmenta-tion method CutPaste which artificially occludes the train-ing images in the Occluded-COCO-Vehicles dataset withall four types of artificial occluders used in the Occluded-Vehicles dataset. While data augmentation increases theperformance of the VGG network, the model still suf-fers from strong occlusions and falls below the CompNet-Multi model that was only trained on non-occluded im-ages. TDAPNet does not benefit from data augmentationas much as the VGG network. For CompositionalNets theperformance increases further when trained with augmenteddata. Overall, CutOut augmentation does not have a largeeffect on the generalization performance of Compositional-Nets, while the proposed CutPaste augmentation proves tobe stronger. In particular, the CompNet-p5 architecture ben-efits strongly, possibly because the network learns to extractmore reliable higher level features under occlusion.

In summary, the classification experiments clearly high-light the robustness of CompositionalNets at classifyingpartially occluded objects, while also being highly discrim-inative when objects are not occluded. Overall Composi-tionalNets significantly outperform dictionary-based com-positional models and other neural network architectures atimage classification under artificial as well as real occlu-sion in all three tested setups - at transfer learning betweendatasets, when trained with non-occluded images and whentrained with strongly augmented data.

4.2. Occlusion Localization

While it is important to classify partially occluded ro-bustly, models should also be able to localize occludersin images accurately. This improves the explainabilityof the classification process and enables future researche.g. for parsing scenes with mutually occluding objects.Therefore, we propose to test the ability of Composition-alNets and dictionary-based compositional models at oc-cluder localization. We compute the occlusion score as thelog-likelihood ratio between the occluder model and theobject model: log p(fp|zmp =1)/p(fp|zmp =0), where m =argmaxm p(F |θmy ) is the model that fits the data the best.

Quantitative results. We study occluder localizationquantitatively on the Occluded-Vehicle dataset using theground truth segmentation masks of the occluders and theobjects. Figure 4 shows the ROC curves of Compositional-Nets (solid lines) and dictionary-based compositional mod-

Figure 4: ROC curves for occlusion localization withdictionary-based compositional models and the proposedCompositionalNets averaged over all levels of partial occlu-sion (L1-L3). CompositionalNets significantly outperformdictionary-based compositional models.

els (dashed lines) when using the occlusion score for clas-sifying each pixel as occluder or object at all occlusion lev-els L1 − L3. The ROC curves show that for both mod-els it is more difficult to localize textured occluders com-pared to white and noisy occluders. Furthermore, it is moredifficult to localize natural object occluders compared totextured boxes, likely because of their fine-grained irreg-ular shape. Overall, CompositionalNets significantly out-perform dictionary-based compositional models. At a falseacceptance rate of 0.2, the performance gain of Composi-tionalNets is: 12% (white), 19% (noise), 6% (texture) and8% (objects).

Qualitative results. Figure 5 qualitatively compares theoccluder localization abilities of dictionary-based compo-sitional models and CompositionalNets. In particular, itshows images of real and artificial occlusions and the corre-sponding occlusion scores for all positions p of the featuremap F . Both models are learned from the pool4 featuremaps of a VGG-16 network. We show more example im-ages in Supplementary D. Note that we visualize the posi-tive values of the occlusion score after median filtering forillustration purposes (see Supplementary E for unfilteredresults). We observe that CompositionalNets can localizeoccluders significantly better compared to the dictionary-based compositional model for real as well as artificial oc-cluders. In particular, it seems that dictionary-based compo-sitional models often detect false positive occlusions. Notehow the artificial occluders with white and noise texture arebetter localized by both models compared to the other oc-cluder types.

In summary, our qualitative and quantitative occluder

7

Page 8: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 5: Visualization of occlusion localization results. Each result consists of three images: The input image, the occlusionscores of a dictionary-based compositional model [11] and the occlusion scores of our proposed CompositionalNet. Notehow our model can localize occluders with higher accuracy across different objects and occluder types for real as well as forartificial occlusion.

localization experiments clearly show that Compositional-Nets can localize occluders more accurately compared todictionary-based compositional models. Furthermore, weobserve that localizing occluders with variable texture andshape is highly difficult, which could be addressed by de-veloping advanced occlusion models.

5. ConclusionIn this work, we studied the problem of classifying par-

tially occluded objects and localizing occluders in images.We found that a standard DCNN does not classify real im-ages of partially occluded objects robustly, even when it hasbeen exposed to severe occlusion during training. We pro-posed to resolve this problem by integrating compositionalmodels and DCNNs into a unified model. In this context,

we made the following contributions:

Compositional Convolutional Neural Networks. Weintroduce CompositionalNets, a novel deep architecturewith innate robustness to partial occlusion. In particular wepropose to replace the fully connected classification head ofa DCNN backbone with a differentiable generative compo-sitional models.

Robustness to partial occlusion. We demonstrate thatCompositionalNets can classify partially occluded objectsmore robustly compared to a standard DCNN and other re-lated approaches, while retaining high discriminative per-formance for non-occluded images. Furthermore, we showthat CompositionalNets can also localize occluders in im-ages accurately, despite being trained with class labels only.

8

Page 9: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917

918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

References[1] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and

Suvrit Sra. Clustering on the unit hypersphere using vonmises-fisher distributions. Journal of Machine Learning Re-search, 6(Sep):1345–1382, 2005. 3

[2] Jifeng Dai, Yi Hong, Wenze Hu, Song-Chun Zhu, and YingNian Wu. Unsupervised learning of dictionaries of hierarchi-cal compositional models. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2505–2512, 2014. 2

[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In 2009 IEEE conference on computer vision andpattern recognition, pages 248–255. Ieee, 2009. 5

[4] Terrance DeVries and Graham W Taylor. Improved regular-ization of convolutional neural networks with cutout. arXivpreprint arXiv:1708.04552, 2017. 1, 2, 7

[5] Alhussein Fawzi and Pascal Frossard. Measuring the effectof nuisance variables on classifiers. Technical report, 2016.2

[6] Sanja Fidler, Marko Boben, and Ales Leonardis. Learn-ing a hierarchical compositional shape vocabulary for multi-class object representation. arXiv preprint arXiv:1408.5516,2014. 2

[7] Dileep George, Wolfgang Lehrach, Ken Kansky, MiguelLazaro-Gredilla, Christopher Laan, Bhaskara Marthi,Xinghua Lou, Zhaoshi Meng, Yi Liu, Huayan Wang,et al. A generative vision model that trains with highdata efficiency and breaks text-based captchas. Science,358(6368):eaag2612, 2017. 1, 2

[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 1

[9] Ya Jin and Stuart Geman. Context and hierarchy in a prob-abilistic image model. In 2006 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition(CVPR’06), volume 2, pages 2145–2152. IEEE, 2006. 2

[10] Adam Kortylewski. Model-based image analysis for foren-sic shoe print recognition. PhD thesis, University of Basel,2017. 1, 2, 3

[11] Adam Kortylewski, Qing Liu, Huiyu Wang, Zhishuai Zhang,and Alan Yuille. Combining compositional models and deepnetworks for robust object classification under occlusion.arXiv preprint arXiv:1905.11826, 2019. 1, 2, 3, 5, 6, 8

[12] Adam Kortylewski and Thomas Vetter. Probabilistic compo-sitional active basis models for robust pattern recognition. InBritish Machine Vision Conference, 2016. 2

[13] Adam Kortylewski, Aleksander Wieczorek, Mario Wieser,Clemens Blumer, Sonali Parbhoo, Andreas Morel-Forster,Volker Roth, and Thomas Vetter. Greedy structure learn-ing of hierarchical compositional models. arXiv preprintarXiv:1701.06171, 2017. 2

[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, pages 1097–1105, 2012. 1

[15] Xilai Li, Xi Song, and Tianfu Wu. Aognets: Compositionalgrammatical architectures for deep learning. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 6220–6230, 2019. 2

[16] Renjie Liao, Alex Schwing, Richard Zemel, and RaquelUrtasun. Learning deep parsimonious representations. InAdvances in Neural Information Processing Systems, pages5076–5084, 2016. 2, 3

[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755.Springer, 2014. 1, 5

[18] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014. 1, 5

[19] Austin Stone, Huayan Wang, Michael Stark, Yi Liu,D Scott Phoenix, and Dileep George. Teaching composi-tionality to cnns. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5058–5067, 2017. 2

[20] Domen Tabernik, Matej Kristan, Jeremy L Wyatt, and AlesLeonardis. Towards deep compositional networks. In2016 23rd International Conference on Pattern Recognition(ICPR), pages 3470–3475. IEEE, 2016. 2

[21] Wei Tang, Pei Yu, and Ying Wu. Deeply learned compo-sitional models for human pose estimation. In Proceedingsof the European Conference on Computer Vision (ECCV),pages 190–206, 2018. 2

[22] Wei Tang, Pei Yu, Jiahuan Zhou, and Ying Wu. Towards aunified compositional model for visual pattern modeling. InProceedings of the IEEE International Conference on Com-puter Vision, pages 2784–2793, 2017. 2

[23] Jianyu Wang, Cihang Xie, Zhishuai Zhang, Jun Zhu, LingxiXie, and Alan Yuille. Detecting semantic parts on partiallyoccluded objects. British Machine Vision Conference, 2017.1, 2, 3

[24] Jianyu Wang, Zhishuai Zhang, Cihang Xie, VittalPremachandran, and Alan Yuille. Unsupervised learning ofobject semantic parts from internal states of cnns by popu-lation encoding. arXiv preprint arXiv:1511.06855, 2015. 3,5

[25] Jianyu Wang, Zhishuai Zhang, Cihang Xie, Yuyin Zhou, Vit-tal Premachandran, Jun Zhu, Lingxi Xie, and Alan Yuille.Visual concepts and compositional voting. arXiv preprintarXiv:1711.04451, 2017. 5

[26] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyondpascal: A benchmark for 3d object detection in the wild. InIEEE Winter Conference on Applications of Computer Vi-sion, pages 75–82. IEEE, 2014. 5

9

Page 10: Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a unified deep model with

97297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025

102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079

CVPR#1120

CVPR#1120

CVPR 2020 Submission #1120. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

[27] Mingqing Xiao, Adam Kortylewski, Ruihai Wu, SiyuanQiao, Wei Shen, and Alan Yuille. Tdapnet: Prototypenetwork with recurrent top-down attention for robust ob-ject classification under partial occlusion. arXiv preprintarXiv:1909.03879, 2019. 2, 6

[28] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, SanghyukChun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-larization strategy to train strong classifiers with localizablefeatures. arXiv preprint arXiv:1905.04899, 2019. 1, 2

[29] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. In-terpretable convolutional neural networks. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 8827–8836, 2018. 2

[30] Zhishuai Zhang, Cihang Xie, Jianyu Wang, Lingxi Xie, andAlan L Yuille. Deepvoting: A robust and explainable deepnetwork for semantic part detection under partial occlusion.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 1372–1380, 2018. 1, 2

[31] Hongru Zhu, Peng Tang, Jeongho Park, Soojin Park, andAlan Yuille. Robustness of object recognition under extremeocclusion in humans and computational models. CogSciConference, 2019. 1, 2

[32] Long Leo Zhu, Chenxi Lin, Haoda Huang, Yuanhao Chen,and Alan Yuille. Unsupervised structure learning: Hier-archical recursive composition, suspicious coincidence andcompetitive exclusion. In Computer vision–eccv 2008, pages759–773. Springer, 2008. 2

10