Do Deep Features Generalize From Everyday Objects to Remote

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Do Deep Features Generalize From Everyday Objects to Remote

  • Do Deep Features Generalize from Everyday Objectsto Remote Sensing and Aerial Scenes Domains?

    Otavio A. B. PenattiAdvanced Technologies GroupSAMSUNG Research Institute

    Campinas, SP, 13097-160, [email protected]

    Keiller Nogueira, Jefersson A. dos SantosDepartment of Computer Science

    Universidade Federal de Minas GeraisBelo Horizonte, MG, 31270-010, Brazil



    In this paper, we evaluate the generalization power ofdeep features (ConvNets) in two new scenarios: aerial andremote sensing image classification. We evaluate experi-mentally ConvNets trained for recognizing everyday objectsfor the classification of aerial and remote sensing images.ConvNets obtained the best results for aerial images, whilefor remote sensing, they performed well but were outper-formed by low-level color descriptors, such as BIC. We alsopresent a correlation analysis, showing the potential forcombining/fusing different ConvNets with other descriptorsor even for combining multiple ConvNets. A preliminaryset of experiments fusing ConvNets obtains state-of-the-artresults for the well-known UCMerced dataset.

    1. Introduction

    The recent impressive results of methods based on deeplearning for computer vision applications brought fresh airto the research and industrial community. We could ob-serve real improvements in several applications such as im-age classification, object and scene recognition, face recog-nition, image retrieval, and many others.

    Deep learning for computer vision is usually associ-ated with the learning of features using an architecture ofconnected layers and neural networks. They are usuallycalled Convolutional (Neural) Networks or ConvNets. Be-fore deep learning has attracted the attention of the commu-nity in the latest years, the most common feature descrip-tors were shallow without involving machine learning dur-ing feature extraction. Common visual descriptors, whichare still interesting alternatives for feature extraction, aremid-level (bags of visual words BoVW) and global low-level color and texture descriptors (e.g., GIST [24], colorhistograms, and BIC [9]). BoVW descriptors are somewhata step in the direction of feature learning, as the visual code-

    book is usually learned for dataset of interest. Global de-scriptors, however, have a pre-defined algorithm for extract-ing the image feature vector, independently of the dataset tobe processed. They tend to be less precise, but they are usu-ally fast to compute.

    ConvNets have shown astounding results even indatasets with different characteristics from which they weretrained, feeding the theory that deep features are able to gen-eralize from one dataset to another. Successful ConvNetsfreely available in the literature are OverFeat and Caffe,which were originally trained to recognize the 1,000 objectcategories of ImageNet [28, 18]. OverFeat, for instance, hasalready shown that it works remarkably well in applicationslike flower categorization, human attribute detection, birdsub-categorization, and scene retrieval. In [28], Razavian etal. suggest that features obtained from deep learning shouldbe the primary candidate in most visual recognition tasks.

    The use of deep learning for remote sensing is rapidlygrowing. A considerable number of works appeared veryrecently proposing deep strategies for spatial and spectralfeature learning. Even though, to the best of our knowl-edge, there is still no evaluation of pre-trained ConvNets inthe aerial and remote sensing domain. Therefore, this pa-per adds two more domains in which pre-trained ConvNets,like OverFeat and Caffe, are evaluated and compared withexisting image descriptors.

    In this paper, besides evaluating ConvNets in a differentdomain, we also perform an evaluation of several other im-age descriptors, including simple low-level descriptors andmid-level representations. The evaluation is based on theclassification of aerial image scenes and on remote sens-ing images aiming at differentiating coffee and non-coffeecrop tiles. We also conduct a correlation analysis in order toidentify the most promising descriptors for selection/fusion.The correlation analysis includes even different ConvNets.

    We can summarize the main contributions of this paperas follows:


  • evaluation of the generalization power of ConvNetsfrom everyday objects to the aerial and remote sens-ing domain,

    comparative evaluation of global descriptors, BoVWdescriptors, and ConvNets,

    correlation analysis among different ConvNets andamong different descriptors.

    On top of that, we performed preliminary experimentsfor fusing ConvNets and obtained state-of-the-art resultsfor the classification of aerial images using the UCMerceddataset. For the remote sensing domain, we created a newdataset, which is publicly released.

    The remainder of this paper is organized as follows. Sec-tion 2 presents related work. The ConvNets and the descrip-tors evaluated in this paper are presented in Section 3. Theexperimental setup and datasets are presented in Section 4.In Section 5, we present and discuss the results obtained.Finally, Section 6 concludes the paper.

    2. Related Work

    As far as the remote sensing (RS) community, moti-vated by the accessibility to high spatial resolution data,started using more than pixel information for classification,the study of algorithms for spatial extraction informationhas been a hot research topic. Although many descrip-tors have been proposed or successfully used for RS im-age processing [42, 11, 3], some applications require morespecific description techniques. As an example, very suc-cessful low-level descriptors in computer vision applica-tions do not yield suitable results for coffee crop classifi-cation, as shown in [12]. Anyway, the general conclusionis that ordinary descriptors can achieve suitable results inmost of applications. However, higher accuracy rates areyielded by the combination of complementary descriptorsthat exploits late fusion learning techniques. In this con-text, frameworks have been proposed for selection of spa-tial descriptors in order to learn the best algorithms for eachapplication [10, 7, 14, 35]. In [10], the authors analyzedthe effectiveness and the correlation of different low-leveldescriptors in multiple segmentation scales. They also pro-posed a methodology to select a subset of complementarydescriptors for combination. In [14], Faria et al. proposed anew method for selecting descriptors and pattern classifiersbased on rank aggregation approaches. Cheriyadat [7] pro-posed a feature learning strategy based on Sparse Coding.The strategy learns features in well-known datasets from theliterature and uses for detection of buildings in larger imagesets. Tokarczyk et al. [35] proposed a boosting-based ap-proach for the selection of low-level features for very-highresolution semantic classification.

    Artificial Neural Networks have been used for RS clas-sification for a long time [2]. But, similarly to the computervision community, its massive use is recent and chiefly mo-tivated by the study on deep learning-based approaches thataims at the development of powerful application-orienteddescriptors. Many works have been proposed to learn spa-tial feature descriptors [15, 17, 45]. Firat et al. [15] pro-posed a method based on ConvNets for object detection inhigh-resolution remote sensing images. Hung et al. [17] ap-plied ConvNets to learn features and detect invasive weed.Zhang et al. [45] proposed a deep feature learning strategythat exploits a pre-processing salience filtering. In [41], theauthors presented an approach to learn features in SyntheticAperture Radar (SAR) images. Moreover, the deep learn-ing boom has been seen as the golden opportunity for de-veloping effective hyperspectral and spatio-spectral featuredescriptors [29, 23, 6, 36].

    In the computer vision community, with the release ofpre-trained ConvNets, like OverFeat [31] and Caffe [18],they started being evaluated in different applications thanthe ones they were trained for. In [28], for instance, a Con-vNet trained for recognizing 1,000 object categories hasshown very good results even in applications like bird sub-categorization, scene retrieval, human attribute detectionand others, which are considerably different than everydayobject recognition. Those facts raised the issue about thegenerality of the features computer by ConvNets.

    In this paper, we go in this direction of evaluating pre-trained ConvNets in different domains. It is worth to men-tion that, to the best of our knowledge, there is no otherwork in literature that evaluate the feasibility of using deepfeatures from general computer vision datasets in remotesensing applications. In addition, no other work in the lit-erature has evaluated the complementarity of deep featuresaiming at fusion or classifier ensemble.

    3. Feature Descriptors

    In this section, we describe the ConvNets, low-level(global), and mid-level (BoVW) descriptors we have used.The descriptors we have selected for evaluation were mainlybased on previous works [42, 11, 37, 26, 44, 12], in whichthey were evaluated for remote sensing image classification,texture and color image retrieval/classification, and web im-age retrieval. Besides the evaluation of ConvNets, we alsoselected a set of other types of descriptors. Our selection in-cludes simple global descriptors, like descriptors based oncolor histograms and variations, and also descriptors basedon bags of visual words (BoVW).

    3.1. Convolutional Networks

    In this section, we provide details about the ConvNetsused in this work, which are OverFeat [31] and Caffe [18].

  • OverFeat [31] is a deep learning framework focused onConvNets. It is implemented in C++ and trained with theTorch7 package1. OverFeat was trained on t