Object-Level Context Modeling For Scene Classification with ...

2
Object-Level Context Modeling For Scene Classification with Context-CNN Syed Ashar Javed 1* and Anil Kumar Nelakanti 2* 1 IIIT Hyderabad, 2 Amazon {[email protected],[email protected]} 1. Introduction The task of classifying a scene requires assimilation of complex, inter-connected information about the objects and the context surrounding their presence. Although deep CNN based models provide a decent baseline for scene classification, vanilla CNNs by design, are not suitable for capturing contextual knowledge like the complex interaction of objects in a scene. More sophisticated approaches from the recent literature either involve multiple networks with high number of parameters trained for weeks or models involving components which are learned separately which limits the effectiveness of the complete learning system due to the need to fuse these components. In this work, we propose the Context-CNN model which encodes object-level context using object proposals and LSTM units on top of a CNN which extracts deep image features. This architecture attempts to bridge the semantic gap in scenes by modeling object-object and scene-object relationships within an easily-implementable, end-to-end trained system. Our model builds on earlier work before deep learning took off where context was explicitly modeled in the form of semantic context (object co-occurrence), spatial context and scale context [2]. But unlike these approaches, our model can take into account the semantic context of a set of objects instead of a pair, does not involve separate terms for the classifier probability and context probability which are difficult to fuse and is end-to-end learned. We benchmark the model on the LSUN dataset [3] which contains 10 million images across 10 categories. The Context-CNN model achieves an accuracy of 89.03% on the validation set which makes it one of the top performing models on this dataset. Additionally, it only uses 2% of the dataset to converge to this score. We also compare our base network with variations of our model which aim to verify the source of performance gain in comparison to vanilla CNNs. Additionally, we also analyse the CNN and LSTM features and perform experiments to highlight the context modeling capacity and the discriminative capacity of the For more experiments, see the complete paper here: https:// arxiv.org/pdf/1705.04358.pdf Figure 1. Context-CNN model architecture model. 2. Context-CNN model Our model (see Figure 1) uses a pre-trained VGG16 network to extract CNN features. The input size of the images are fixed at 512 × 512 and the last convolutional layer produces feature maps of size 32 × 32. Bounding boxes are extracted using edge boxes [4] and the feature maps of these object boxes are passed through an RoI pooling layer [1] to generate a fixed size vector of size 7 × 7 per feature map. These object vectors are passed as input to two subsequent layers of LSTM units in decreasing order of their confidence score with increasing time steps. The output of all time steps are concatenated to build the final feature vector and fed into the dense layers and then through a softmax layer for prediction. A shortened functional form of the LSTM unit can be summarised as: (c t ,h t )= LSTM (x t ,h t-1 ,c t-1 ,W ) (1) Thus, with each passing time step, the LSTM reads in an individual object feature vector and updates its memory. This memory helps the model capture scene context by relating objects occurring in that given scene and distinguishing it from other scenes. The discriminative capacity of the network improves as the LSTM receives more information with increasing time steps. A part of this work was done while the authors were at Cube26, data science lab, New Delhi

Transcript of Object-Level Context Modeling For Scene Classification with ...

Page 1: Object-Level Context Modeling For Scene Classification with ...

Object-Level Context Modeling For Scene Classification with Context-CNN

Syed Ashar Javed1∗ and Anil Kumar Nelakanti2∗1IIIT Hyderabad, 2Amazon

{[email protected],[email protected]}

1. IntroductionThe task of classifying a scene requires assimilation of

complex, inter-connected information about the objects andthe context surrounding their presence. Although deepCNN based models provide a decent baseline for sceneclassification, vanilla CNNs by design, are not suitablefor capturing contextual knowledge like the complexinteraction of objects in a scene. More sophisticatedapproaches from the recent literature either involve multiplenetworks with high number of parameters trained forweeks or models involving components which are learnedseparately which limits the effectiveness of the completelearning system due to the need to fuse these components.

In this work, we propose the Context-CNN model whichencodes object-level context using object proposals andLSTM units on top of a CNN which extracts deep imagefeatures. This architecture attempts to bridge the semanticgap in scenes by modeling object-object and scene-objectrelationships within an easily-implementable, end-to-endtrained system.

Our model builds on earlier work before deep learningtook off where context was explicitly modeled in the formof semantic context (object co-occurrence), spatial contextand scale context [2]. But unlike these approaches, ourmodel can take into account the semantic context of a set ofobjects instead of a pair, does not involve separate terms forthe classifier probability and context probability which aredifficult to fuse and is end-to-end learned. We benchmarkthe model on the LSUN dataset [3] which contains 10million images across 10 categories. The Context-CNNmodel achieves an accuracy of 89.03% on the validationset which makes it one of the top performing models onthis dataset. Additionally, it only uses 2% of the datasetto converge to this score. We also compare our basenetwork with variations of our model which aim to verifythe source of performance gain in comparison to vanillaCNNs. Additionally, we also analyse the CNN and LSTMfeatures and perform experiments to highlight the contextmodeling capacity and the discriminative capacity of the

For more experiments, see the complete paper here: https://arxiv.org/pdf/1705.04358.pdf

Figure 1. Context-CNN model architecture

model.

2. Context-CNN modelOur model (see Figure 1) uses a pre-trained VGG16

network to extract CNN features. The input size of theimages are fixed at 512 × 512 and the last convolutionallayer produces feature maps of size 32 × 32. Boundingboxes are extracted using edge boxes [4] and the featuremaps of these object boxes are passed through an RoIpooling layer [1] to generate a fixed size vector of size 7×7per feature map. These object vectors are passed as inputto two subsequent layers of LSTM units in decreasing orderof their confidence score with increasing time steps. Theoutput of all time steps are concatenated to build the finalfeature vector and fed into the dense layers and then througha softmax layer for prediction. A shortened functional formof the LSTM unit can be summarised as:

(ct, ht) = LSTM(xt, ht−1, ct−1,W ) (1)

Thus, with each passing time step, the LSTM readsin an individual object feature vector and updates itsmemory. This memory helps the model capture scenecontext by relating objects occurring in that given sceneand distinguishing it from other scenes. The discriminativecapacity of the network improves as the LSTM receivesmore information with increasing time steps.

A part of this work was done while the authors were at Cube26, datascience lab, New Delhi

Page 2: Object-Level Context Modeling For Scene Classification with ...

Figure 2. Analysis through obscuration: Systematic blacking out of the object bounding boxes one by one before passing the image throughthe model and then comparing the obtained softmax distribution with the base model. The blacked out bounding box which most adverselyaffects the softmax activation of the correct class are shown.

Figure 3. Models comparison: (a) is the base Context-CNN model.(b) shows the the first variation with the output of LSTM comingonly from the last time step. (c) shows the second variation withthe LSTM units replaced by dense units. (d) is a VGG16 network

3. Experiments & resultsWe train and test our model on the LSUN dataset. The

best performing variant of our model achieves an accuracyof 89.03% which is among the best results for this dataset.

Method Accuracy (%)SIAT MMLAB 91.61

Google 91.20SJTU-ReadSense(ensemble) 90.43TEG Rangers(ensemble) 88.70

Our model 89.03Table 1. Evaluation on the LSUN dataset

We further compare the base model against three othervariations as shown in Figure 3. The 1st variation highlightsthe importance of the information obtained from the highconfidence score objects fed in earlier time steps. The 2nd

variation highlights the difference in performance betweenfully connected units versus LSTM units. The 3rd is simplya VGG16 model for comparison. Note that even thoughboth VGG16 and Context-CNN share the same convolutionlayers, our model outperforms a VGG16 network by 5.6%with 8 million fewer parameters.

4. Analysis and visualisationWe visualise features obtained from the CNN and

compare it with features obtained from various time stepsof the LSTM using t-SNE (see Figure 4).

We also use occlusion to evaluate the significance of

Model Variation Accuracy(%)Context-CNN base model 89.03

Context-CNN with lasttime step

87.34

Context-CNN withLSTM replaced

85.47

VGG16 83.41Table 2. Model comparison

Figure 4. t-SNE visualisation: In (a), each data point is a CNNfeature vector of a single bounding box obtained from the RoIpooling layer. (b), (c) and (d) show the output feature vectors fromthe 1st, 5th and 10th time step of the LSTM respectively. (SeeFigure 2 for the names of all classes and their ID). The plot clearlyshows how the discriminative ability of the features of the objectbounding boxes change across the CNN and the various time stepsof the LSTM.

objects in the scene in Figure 2. The importance ofa bounding box is measured by the reduction in thesoftmax score of the correct class if the bounding box wasobscured and the corresponding object occluded. The mostsignificant bounding box is the one that leads to maximumreduction in the softmax score.

References[1] R. Girshick. Fast r-cnn. In Proceedings of the

IEEE International Conference on Computer Vision, pages1440–1448, 2015.

[2] D. Parikh, C. L. Zitnick, and T. Chen. From appearance tocontext-based recognition: Dense labeling in small images. InComputer Vision and Pattern Recognition, 2008. CVPR 2008.IEEE Conference on, pages 1–8. IEEE, 2008.

[3] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, andJ. Xiao. Lsun: Construction of a large-scale image datasetusing deep learning with humans in the loop. arXiv preprintarXiv:1506.03365, 2015.

[4] C. L. Zitnick and P. Dollar. Edge boxes: Locating objectproposals from edges. In ECCV, 2014.