A Multi-Domain Feature Learning Method for Visual Place...

6
A Multi-Domain Feature Learning Method for Visual Place Recognition Peng Yin 1,* , Lingyun Xu 1 , Xueqian Li 3 , Chen Yin 4 , Yingli Li 1 , Rangaprasad Arun Srivatsan 3 , Lu Li 3 , Jianmin Ji 2,* , Yuqing He 1 Abstract—Visual Place Recognition (VPR) is an important component in both computer vision and robotics applications, thanks to its ability to determine whether a place has been visited and where specifically. A major challenge in VPR is to handle changes of environmental conditions including weather, season and illumination. Most VPR methods try to improve the place recognition performance by ignoring the environmental factors, leading to decreased accuracy decreases when environmental conditions change significantly, such as day versus night. To this end, we propose an end-to-end conditional visual place recognition method. Specifically, we introduce the multi-domain feature learning method (MDFL) to capture multiple attribute-descriptions for a given place, and then use a feature detaching module to separate the environmental condition-related features from those that are not. The only label required within this feature learning pipeline is the environmental condition. Evaluation of the proposed method is conducted on the multi-season NORDLAND dataset, and the multi-weather GTAV dataset. Experimental results show that our method improves the feature robustness against variant environmental conditions. I. I NTRODUCTION In the last decade, the robotics community has achieved numerous breakthroughs in vision-based simultaneous local- ization and mapping (SLAM) [1] that have enhanced the navigation abilities of unmanned ground vehicles (UGV) and unmanned aerial vehicles (UAV) in complex environment. Visual place recognition (VPR) [2] or loop closure detection (LCD) helps robots to find loop closure in SLAM frame- work and is an essential element for accurate mapping and localization. Although many methods have been proposed in recent years, VPR is still a challenging problem under vary- ing environmental conditions. Traditional VPR approaches that use handcrafted features to learn place descriptors for local scene description, often fail to extract valid features when encountering significant changes [3] in environmental This paper was supported by the National Natural Science Foundation of China (No. 61573386, No. 91748130, U1608253) and Guangdong Province Science and Technology Plan projects (No. 2017B010110011). P. Yin, L. Xu, Y. Li and Y. He are with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, University of Chinese Academy of Sciences, Beijing. (yinpeng, xulingyun, liyingli, [email protected]) J. Ji is with the School of Computer Science and Technology, University of Science and Technology of China, Hefei Anhui. ([email protected]) X. Li, R.A. Srivatsan and L. Li are with the Biorobotics Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA. (xueqianl, arangapr, [email protected]) Y. Chen is with the School of Computer Science, University of Beijing Uni- versity of Posts and Telecommunications, Beijing. ([email protected]) (Corresponding author: Peng Yin, Jianmin Ji) Fig. 1: The pipeline of our proposed conditional visual place recognition method. In summary, there exists three core modules: 1) a CapsuleNet [4] based feature extraction module that is re- sponsible for extracting condition-related and condition-invariant features from the raw visual inputs; 2) a condition enhanced feature separation module to further separate condition-related ones in the joint feature-distribution; 3) a trajectory searching mechanism for finding best matches based on the feature differences of query trajectory features. conditions, such as changes in season, weather, illumination, as well as viewpoints. Ideally, the place recognition method should be able to capture condition-invariant features for robust loop closure detection, since the appearance of scene objects (e.g., roads, terrains and houses) is often highly related to environmental conditions, and that each object has its own appearance distribution under variant conditions. To the best of our knowledge, there are few VPR methods that have explored how to improve the place recognition performance against variant environmental conditions [5]. A major drawback of these methods is that the change in environmental conditions affects the local features, resulting in decreased accuracy of VPR. In this paper, we propose the condition-directed visual place recognition method to address this issue. Our work consists of two parts: feature extraction and feature separation. Firstly, in the feature extraction step, we utilize a CapsuleNet-based network [4] to extract multi-domain place features, as shown in Fig. 1. Traditional convolutional neural network (CNN) is efficient in object detection , regression and segmentation, but as pointed out by Hinton, the inner

Transcript of A Multi-Domain Feature Learning Method for Visual Place...

Page 1: A Multi-Domain Feature Learning Method for Visual Place ...biorobotics.ri.cmu.edu/papers/paperUploads/ICRA2018_MDFL.pdffeature learning for the visual place recognition task. We propose

A Multi-Domain Feature Learning Method forVisual Place Recognition

Peng Yin1,∗, Lingyun Xu1, Xueqian Li3, Chen Yin4, Yingli Li1,Rangaprasad Arun Srivatsan3, Lu Li3, Jianmin Ji2,∗, Yuqing He1

Abstract—Visual Place Recognition (VPR) is an importantcomponent in both computer vision and robotics applications,thanks to its ability to determine whether a place has beenvisited and where specifically. A major challenge in VPRis to handle changes of environmental conditions includingweather, season and illumination. Most VPR methods try toimprove the place recognition performance by ignoring theenvironmental factors, leading to decreased accuracy decreaseswhen environmental conditions change significantly, such asday versus night. To this end, we propose an end-to-endconditional visual place recognition method. Specifically, weintroduce the multi-domain feature learning method (MDFL)to capture multiple attribute-descriptions for a given place,and then use a feature detaching module to separate theenvironmental condition-related features from those that arenot. The only label required within this feature learning pipelineis the environmental condition. Evaluation of the proposedmethod is conducted on the multi-season NORDLAND dataset,and the multi-weather GTAV dataset. Experimental results showthat our method improves the feature robustness against variantenvironmental conditions.

I. INTRODUCTION

In the last decade, the robotics community has achievednumerous breakthroughs in vision-based simultaneous local-ization and mapping (SLAM) [1] that have enhanced thenavigation abilities of unmanned ground vehicles (UGV) andunmanned aerial vehicles (UAV) in complex environment.Visual place recognition (VPR) [2] or loop closure detection(LCD) helps robots to find loop closure in SLAM frame-work and is an essential element for accurate mapping andlocalization. Although many methods have been proposed inrecent years, VPR is still a challenging problem under vary-ing environmental conditions. Traditional VPR approachesthat use handcrafted features to learn place descriptors forlocal scene description, often fail to extract valid featureswhen encountering significant changes [3] in environmental

This paper was supported by the National Natural Science Foundation ofChina (No. 61573386, No. 91748130, U1608253) and Guangdong ProvinceScience and Technology Plan projects (No. 2017B010110011).

P. Yin, L. Xu, Y. Li and Y. He are with the State Key Laboratory ofRobotics, Shenyang Institute of Automation, Chinese Academy of Sciences,Shenyang, University of Chinese Academy of Sciences, Beijing. (yinpeng,xulingyun, liyingli, [email protected]) J. Ji is with the School of ComputerScience and Technology, University of Science and Technology of China,Hefei Anhui. ([email protected]) X. Li, R.A. Srivatsan and L. Li arewith the Biorobotics Lab, Robotics Institute, Carnegie Mellon University,Pittsburgh, PA 15213, USA. (xueqianl, arangapr, [email protected])Y. Chen is with the School of Computer Science, University of Beijing Uni-versity of Posts and Telecommunications, Beijing. ([email protected])

(Corresponding author: Peng Yin, Jianmin Ji)

Fig. 1: The pipeline of our proposed conditional visual placerecognition method. In summary, there exists three core modules:1) a CapsuleNet [4] based feature extraction module that is re-sponsible for extracting condition-related and condition-invariantfeatures from the raw visual inputs; 2) a condition enhanced featureseparation module to further separate condition-related ones in thejoint feature-distribution; 3) a trajectory searching mechanism forfinding best matches based on the feature differences of querytrajectory features.

conditions, such as changes in season, weather, illumination,as well as viewpoints.

Ideally, the place recognition method should be able tocapture condition-invariant features for robust loop closuredetection, since the appearance of scene objects (e.g., roads,terrains and houses) is often highly related to environmentalconditions, and that each object has its own appearancedistribution under variant conditions. To the best of ourknowledge, there are few VPR methods that have exploredhow to improve the place recognition performance againstvariant environmental conditions [5]. A major drawback ofthese methods is that the change in environmental conditionsaffects the local features, resulting in decreased accuracyof VPR. In this paper, we propose the condition-directedvisual place recognition method to address this issue. Ourwork consists of two parts: feature extraction and featureseparation.

Firstly, in the feature extraction step, we utilize aCapsuleNet-based network [4] to extract multi-domain placefeatures, as shown in Fig. 1. Traditional convolutional neuralnetwork (CNN) is efficient in object detection , regressionand segmentation, but as pointed out by Hinton, the inner

Page 2: A Multi-Domain Feature Learning Method for Visual Place ...biorobotics.ri.cmu.edu/papers/paperUploads/ICRA2018_MDFL.pdffeature learning for the visual place recognition task. We propose

connections of objects are easily lost with the deep convo-lutional and max pooling operations. For instance, in facedetection tasks, even if the facial objects (nose, eyes, mouth,lips) are in incorrect layouts, the traditional CNN methodmay still consider the image as a human face, since it containsall the necessary features of a human face. This problemalso exists in place recognition tasks, since different placesmay contain similar objects but with different arrangements.CapsuleNet uses an dynamic routing method to cluster theshallow convolutional layer features in an unsupervised way.In this paper, we demonstrate another application of Capsu-leNet, which could capture feature distribution under specificconditions.

The main contributions of this work can be summarizedas follows:• We propose the use of CapsuleNet-based feature extrac-

tion module, and show its robustness in the conditionalfeature learning for the visual place recognition task.

• We propose a feature separation method for the vi-sual place recognition task, where features are in-directly separated based on the relationship betweencondition-related and condition-invariant features in aninformation-theoretic view.

The outline of the paper is as follows: Section II introducesthe related works on visual-based place recognition methods.Section III describes our conditional visual place recognitionmethod, which has two components: feature extraction andfeature separation. In Section IV, we evaluate the proposedmethod on two challenging datasets: the NORDLAND [6]dataset which has same trajectories under multiple seasonconditions and a GTAV dataset which is generated on thesame trajectory under different weather conditions in a gamesimulator. Finally, we provide concluding remarks in Sec-tion V. The linked video1 provides a visualization of theresults of our method.

II. RELATED WORK

Visual place recognition (VPR) methods have been wellstudied in past several years, and can be classified intotwo categories: feature- and appearance-based. In feature-based VPR, descriptive features are transformed into localplace descriptors. Then, place recognition can be achievedby extracting the current place descriptors and searchingsimilar place indexes in the bag of words. On the contrary,appearance-based VPR uses feature descriptors that are ex-tracted from the entire image, and performs place recognitionby assessing feature similarities. SeqSLAM [3] describesimage similarities by directly using the sum of absolutedifference (SAD) between frames, while vector of locallyaggregated descriptors (VLAD) [7] aggregates local invariantfeatures into a single feature vector and uses Euclideandistance between vectors to quantify image similarities.

Recently, many works have investigated CNN-based fea-tures for appearance-based VPR tasks. Sunderhauf et al. [8]

1https://youtu.be/dS028yXKNlw

first used pre-trained VGG model to extract middle-layerCNN outputs as image descriptors in the sequence matchingpipeline. However, a pre-trained network can not be furthertrained for place recognition task, since the data labels arehard to define in VPR task. Recently, Chen et al. [9] andGarg et al. [5] address the conditional invariant VPR as animage classification task and rely on precise but expensivehuman labeling for semantic labels. Arandjelovic et al. [10]developed NetVLAD, which is a modified form of theVLAD features, with CNN networks to improve the featurerobustness.

The approach that comes closest to our method is thework of Porav et al. [11], where they learn invertible gener-ators based on the CycleGAN [12], The original CycleGANmethod can transform the image from one domain to anotherdomain, but such transformation is limited to only twodomains. Thus, for multiple domain place recognition task,the method of Porav et al. requires transformation modelbetween each pair of conditions. In contrast, our method canlearn more than two conditions in the same structure.

III. PROPOSED METHOD

In this section, we investigate the details of two coremodules in our conditional visual place recognition method.

A. Feature Extraction

a) VLAD: VLAD is a feature encoding and poolingmethod, which encodes a set of local feature descriptorsextracted from an image by using a clustering method suchas K-means clustering. For the feature extraction module, weextract multi-domain place features from the raw image, byutilizing a CapsuleNet module. Let qik be the strength ofthe association of data vector xi to the cluster µk, such thatqik ≥ 0 and

∑Kk=1 qik = 1, where K is the clusters number.

VLAD encodes feature x by considering the residuals

vk =

N∑i=1

qik(xi − µk),

and the joint feature description {v1, v2, v3, ..., vN}, whereN is the local features number.

Assume we can extract N lower feature descriptors (eachis denoted as xi) from the raw image, we can construct anew VLAD like module with the following equation,

vk =

N∑i=1

Qk(li)r(xi, µk), (1)

where r(xi, µk) is the residual function measuring similar-ities between xi and µk, and Qk(li) is the weighting ofcapsule vector li involved with the kth cluster center.

b) Modified CapsuleNet: In order to transform Eq.1 intoan end-to-end learning block, we consider two aspects:

1) Constructing the residual function r(xi, µk);2) Assigning the weights Qk(li).

With lower layer features extracted from the shallow convo-lution layer, we use N×Dproperty matrix to map lower level

Page 3: A Multi-Domain Feature Learning Method for Visual Place ...biorobotics.ri.cmu.edu/papers/paperUploads/ICRA2018_MDFL.pdffeature learning for the visual place recognition task. We propose

features into higher level features, where N is the CNN unitnumber in the shallow convolution layer.

In order to integrate the lower-higher feature mappingwithin a single layer, the local lower level feature xi shouldhave a linear mapping layer to represent the residual functionr(xi, µk) = f(xi, µk)− 1

N

∑k f(xi, µk), where f(xi, µk) =

Wikxi + bk, Wik and bk are the linear transformationweighting and bias for the kth capsule center.

Furthermore, to estimate the local capsule features weight-ing Qk(xi), we apply a soft assignment estimation definedas Qk(xi) = exp(bik)∑K

j exp(bij), where bik is the probability that

the ith local capsule feature belonging to kth capsule clusterck. Therefore, Eq.1 can be written in the following format,

vk =

N∑i=1

exp(bik)∑Kj exp(bij)

(f(xi, µk)−1

N

∑k

f(xi, µk)).

In order to learn the parameters bij ,Wik, and ck, we applythe iterative dynamic routing mechanism as described in [4].For the output of Nobject higher level features, we assume thelast DC dimensions are assigned as the condition features,e.g. DC = 4 is in the case where the condition is season.

B. Feature SeparationIn the previous section, we described the feature ex-

traction module pθ. In this section, we use an additionaldecoder module qφ, and two reconstruction modules onfeature LFeature and image LImage domain to achieve thefeature separation. Naturally, condition-invariant feature ZGand condition-related feature ZC are highly correlated. Fig. 2shows the relationship between information ZG and ZC .H(ZG, ZC) and I(ZG;ZC) are the joint entropy and themutual entropy respectively, while H(zG|zC) and H(zC |zG)are the conditional entropy.

Fig. 2: The relationship of condition-related ZC and condition-invariant ZG feature in the information theory view.

From the view of information theory, feature separationcan be achieved in the following ways:• Decrease the conditional entropy H(z|x): less condi-

tional entropy enforces the unique mapping from x ∈ χto z ∈ (ZG, ZC);

• Improve the geometric feature extraction capability: themore accurate geometry we capture, the higher LCDaccuracy we can achieve;

• Reduce the mutual entropy I(ZG;ZC |x): use environ-mental conditions to direct feature extraction.

We add these three restrictions in our feature separationmodule.

Feature

Extraction

Classification

Joint

Feature Decoder

DiscriminatorRaw

Data1

0

0/1

Generate

Data

Feature

Extraction

Joint

Feature

Feature

Reconstruction

Image

Reconstruction

Fig. 3: The framework of feature separation. The networks arecombined with four modules: the feature extraction module asgiven in the previous section; a classification module estimating theenvironmental conditions; a decoder module mapping the extractedfeature back to the data domain; a discriminator module distin-guishing the generated data and raw data; and two reconstructionloss modules on data and feature domain respectively.

a) Conditional Entropy Reduction: H(z|x) measuresthe uncertainty of feature z given the data sample x. Theconditional entropy H(z|x) = 0 can be achieved, if andonly if z is the deterministic mapping of x. Thus, reducingHpθ (z|x) can improve the uniqueness mapping from x to z,where pθ is the parameter in the encoder module. However,improving the condition entropy Hpθ (z|x) is intractable,since we can not access the data-label pair (x, z) directly.An alternative approach is to optimize the upper bound ofHpθ (z|x), and the upper bound can be obtained through thefollowing equation,

minθ,φ

Hpθ (z|x) , minθ,φ−∑

pθ(z|x) log(pθ(z|x))

= minθ,φ−∑

pθ(z|x)[log(qφ(z|x))]

−∑

pθ(z|x)[log(pθ(z|x))− log(qφ(z|x))]= min

θ,φHpθ(z|x)[log(qφ(z|x))]

− Epθ(z|x)[KL(pθ(z|x)‖(qφ(z|x)))], (2)

where KL is the Kullback-Leibler divergence. AndHpθ (log(qφ(z|x))) measures the uncertainty of the predictedfeature with a given sample data x. Since we can not extractfeatures from the qθ directly, we add an additional featureencoder module after the decoder module (see Fig .3). Eq. 2can be converted into

minθ,φ

H(z|x) ≤ minθ,φ

Hpθ(z|x)[log(qφ(z|x))]

, minθ,φ

Hz∼pθ(z|x),x∼qφ(x|z)[log(pθ(z = z|x))]

= LFeature(z, z), (3)

where LFeature is the Feature Reconstruction Loss betweenfeature extracted from the raw data and the reconstructeddata. As we can see in Eq. 3, the original Hpθ (z|x) istransformed into its upper bound LFeature(z, z), and theupper bound is reduced only when the feature domain anddata domain are perfectly matched.

b) Feature Extraction Improvement: Condition entropyreduction sub-module can restrict the mapping uncertaintyfrom data domain to the feature domain, this restriction is

Page 4: A Multi-Domain Feature Learning Method for Visual Place ...biorobotics.ri.cmu.edu/papers/paperUploads/ICRA2018_MDFL.pdffeature learning for the visual place recognition task. We propose

highly related to the generalization ability of the encodermodule. For the place recognition task, there will be highlydiverse scenes in practice, however, we can only generatelimited samples for network training. In theory, the GANuses a decoder and discriminator module to learn the potentialfeature-to-data transformation with limited samples. Thus, weimprove the data generalization ability by applying GANs.

LGAN =minφ

maxω

E(log(Dω(x))+ (4)

Ex∼qφ(x|z)(log(1−Dω(x))).

As demonstrated by Goodfellow et.al [13], with iterativeupdating of the decoder module qφ and the discriminatormodule Dω , GAN could pull the generated data distributioncloser to the real data distribution, and improve the general-ization ability of the decoder module qφ.

c) Mutual entropy reduction: I(zG; zC |x) is the mutualentropy, which can be extended by

I(zG; zC |x) =H(zG|x) +H(zC |x)−H(zG, zC |x),

where, reducing the mutual entropy is equivalent to reducingthe right-hand term in the above equation. Since the condi-tional entropy satisfies 0 ≤ H(zG, zC |x) , we can find theupper bound of I(zG; zC |x) by ignoring H(zG, zC |x)

minθI(zG; zC |x) ≤ min

θ(H(zG|x) +H(zC |x)).

For the condition-related features, we apply a soft-maxbased classification module LCond, to reduce the conditionalentropy H(zC |x). Furthermore, we apply an L2 image re-construction loss to further restrict the uncertainty H(zG|x)given a sample data x,

LImage = ‖xraw − xreco‖, (5)

where xraw and xreco are the raw image and reconstructedone respectively.

By combining Eq.[3, 4, 5] and LCond, the joint lossfunction can be obtained as

LJoint =LFeature + LGAN + LCond + LImage. (6)

IV. EXPERIMENT RESULTS

In this section, we analyze the performance of our methodon two datasets and compare it with three feature extractionmethods for the visual place recognition task 2.

a) Datasets: The datasets we used here are the Nord-land dataset [14] and the GTAV dataset [15]. The Nordlanddataset was recorded on a train in Norway during fourdifferent seasons, and each sequence follows the same track.In each sequence, we generate 17885 frames from the videoat 12 Hz, and the first 16885 frames of each sequence isused for training, and the last 1000 frames for testing. Notethat we train on all four Norldand seasonal datasets, usingthe seasonal labels to find the condition dependent/invariant

2The experiments are conducted on a single NVIDIA 1080Ti card with64G RAM on the Ubuntu 14.04 system. During testing, the framework takesaround 2000MB of GPU memory

(a) Nordland datasets

(b) GTAV datasets

Fig. 4: Precision-Recall curve of various VPR methods on the twodatasets. The method is considered to be good if the curve is in theupper-right corner. As we see, MDFL outperforms other methodsin most of the cases.

features, and then test on the last 1000 frames of each dataset.In the training procedure, we randomly select frames andtheir corresponding status labels from the four sequences.

The second dataset GTAV [15] contains trajectories onthe same track under three different weather status (sunny,rainy and foggy). This dataset is more challenging than theNordland dataset, since the viewpoints are variant in theGTAV dataset. We generate more than 10, 000 frames ineach sequence, 9000 frames are used as training data, andthe remaining 1000 frames for testing. For each dataset, allimages are resized to 64× 64× 3 in RGB format. The loopclosure detection mechanism is followed as in the originalSeqSLAM method; sequences of image features are matchedinstead of a single image. For more details about the structureof the SeqSLAM, we refer the reader to [3].

b) Accuracy Analysis: To investigate the place recog-nition accuracy, we compare our feature extraction methodwith three methods in sequential matching – (1) the originalfeature in SeqSLAM that uses sum of absolute difference aslocal place feature description, (2) convolution layer featurefrom VGG network, which is trained on the large-scale image

Page 5: A Multi-Domain Feature Learning Method for Visual Place ...biorobotics.ri.cmu.edu/papers/paperUploads/ICRA2018_MDFL.pdffeature learning for the visual place recognition task. We propose

(a) AUC on Nordland datasets

(b) AUC on GTAV datasets

Fig. 5: AUC index of the various VPR methods. The methods matchthe images from the two datasets under different conditions.

classification dataset [16], (3) adversarial feature learning-based unsupervised feature obtained from the generativeadversarial networks [17]. The place is considered as beingmatched when the distance between current frame and targetframe is limited within 10 frames. We evaluate the perfor-mances in the precision-recall curve (PR-curve), area undercurve (AUC) index, inference time, and storage requirement.

Fig.4 and Fig. 5 show the precision-recall curve and AUCindex respectively for all the methods on the Nordlanddatasets and the GTAV datasets. In Fig. 5, the label spr-sum,spr-fall, etc. refer to the performance using the same networkand same model, with different testing sequences.

In general, all the methods perform better in the Nordlanddataset than in the GTAV dataset, since the viewpoints arestable and the geometric changes are smooth due to the con-stant speed of the train. In contrast, test sequences in GTAVdatasets have significant viewpoints differences. Furthermore,limited field of view and multiple dynamic objects in GTAValso introduces additional feature noises, which causes asignificant difference in scene appearance.

VGG features perform well under normal conditions, suchas summer-winter in Nordland, but perform poorly underunusual conditions, which indicates poor generalizability ofVGG features. BiGAN does not perform well on eitherdatasets since it does not take into account the condition ofthe scene and considers all the images as a joint manifold. Forexample, the same place under different weather conditionswill be encoded differently using BiGAN. Since SeqSLAM

uses gray images to ignore the appearance changes underdifferent environmental conditions, the image based featuresin SeqSLAM are robust to changing conditions. But thematching accuracy decreases greatly in the GTAV dataset,since raw image features are very sensitive to the changingviewpoints.

In general, MDFL outperforms the above features in mostcases of the NORDLAND and GTAV datasets, and can handlecomplex situation well, but is not the best in some situations,such as spring-summer in NORDLAND and rain-summer inGTAV. One potential reason is that, in each dataset, we onlyconsider one type of environmental condition (Season orweather), but we did not take into account the illuminationchanges. Since the illumination changes continuously, it isnot easy to set this type of environmental condition intraining.

Table I shows the average AUC results of different methodsin both datasets, and our MDFL method outperforms allthe other methods. Notice that for all the above results, theMFDL framework was trained only once end-to-end, and thisis one of the advantages of our framework. We do not needto retrain different networks for each pair of conditions.

TABLE I: Average AUC index

Dataset Caps SeqSLAM VGG16 BiGAN

GTAV 0.790 0.518 0.715 0.627Nordland 0.912 0.876 0.804 0.345

Fig. 6: Image reconstruction using MDFL on Nordland (a,b) andGTAV datasets (c,d). For (a) and (b), MDFL extracts geometryfeatures from right side image (labelled 0.), and reconstructs theminto four different seasons (1. spring, 2. summer, 3. fall, 4. winter)shown on the left side. For (c) and (d), MDFL extracts geometryfeatures from the right bottom image of each scene, and reconstructsit into three images with different weather conditions (5. sunny, 6.rainy and 7. foggy).

Page 6: A Multi-Domain Feature Learning Method for Visual Place ...biorobotics.ri.cmu.edu/papers/paperUploads/ICRA2018_MDFL.pdffeature learning for the visual place recognition task. We propose

c) Image Reconstruction: In order to check whetherour framework has learned the geometry features irrespec-tive of the conditions in the scene, we perform a scenereconstruction task using our method. As seen in Fig. 6, foreach dataset, we reconstruct images in different conditionsfrom the same input frame. In the Nordland dataset, thereconstructed scene around a mountain (Fig. 6(a)) can capturedifferent terrain styles in the seasons; the reconstructed scenearound the lake (Fig. 6(b)) can also generalize the lakeappearance in winter condition. Still, the strong contrastbetween sky and road leads to reconstruction failure in otherweather conditions, as shown in the case of the white roadin Fig. 6(d).

d) Inference time and Storage: The joint features com-bining both geometry and condition information are stored ina 64× 16 matrix with a char format. The space required fora local place description is only 1 kB. Assuming that a robotis encoding at 1 Hz, the memory space required for one dayof recording is only around 85 MB. This is suitable for longterm mobile robot navigation tasks, where the storage is acritical concern. Furthermore, the average inference time forone frame on a GTX 1080Ti and a Jetson Tx1 is around 3.5ms and 30 ms relatively. This allows the MDFL method tobe applied in real-time, on mobile platforms.

V. CONCLUSION

In this paper, we introduced a novel multi-domain featurelearning method for visual place recognition task. At thecore of our framework lies the idea of extracting condition-invariant features for place recognition under various environ-mental conditions. A CapsuleNet-based module was used tocapture multi-domain features from the raw image, and applya feature separation module to indirectly separate condition-related and condition-invariant features. Experiments on themulti-season and multi-weather datasets demonstrate the ro-bustness of our method. The major limitation for our methodis that the shallow layer CapsuleNet-based module onlyclusters lower level features, and is not effective at capturingthe semantic descriptions for the place recognition. In ourfuture work, we will investigate hierarchical CapsuleNetnetwork module to extract higher level semantic features forplace recognition.

REFERENCES

[1] H. Zhou, K. Ni, Q. Zhou, and T. Zhang, “An SFMalgorithm with good convergence that addresses outliersfor realizing mono-SLAM,” Transactions on IndustrialInformatics, vol. 12, no. 2, pp. 515–523, 2016.

[2] S. Lowry, N. Snderhauf, P. Newman, J. J. Leonard,D. Cox, P. Corke, and M. J. Milford, “Visual placerecognition: A survey,” IEEE Transactions on Robotics,vol. 32, no. 1, pp. 1–19, Feb 2016.

[3] M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormywinter nights,” in IEEE International Conference onRobotics and Automation, May 2012, pp. 1643–1649.

[4] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamicrouting between capsules,” in Advances in Neural In-formation Processing Systems, 2017, pp. 3859–3869.

[5] S. Garg, A. Jacobson, S. Kumar, and M. Milford,“Improving condition-and environment-invariant placerecognition with semantic place categorization,” inIROS. IEEE, 2017, pp. 6863–6870.

[6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Visionmeets robotics: The KITTI dataset,” The InternationalJournal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.

[7] Y. Wang, Y. Cen, R. Zhao, S. Kan, and S. Hu, “Fusionof multiple vlad vectors based on different features forimage retrieval,” in 13th International Conference onSignal Processing. IEEE, 2016, pp. 742–746.

[8] N. Sunderhauf, F. Dayoub, S. Shirazi, B. Upcroft, andM. Milford, “On the performance of ConvNet featuresfor place recognition,” in IROS, 2015, pp. 4297–4304.

[9] Z. Chen, A. Jacobson, N. Sunderhauf, B. Upcroft,L. Liu, C. Shen, I. Reid, and M. Milford, “Deep learningfeatures at scale for visual place recognition,” in ICRA.IEEE, 2017, pp. 3223–3230.

[10] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, andJ. Sivic, “Netvlad: Cnn architecture for weakly super-vised place recognition,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion, 2016, pp. 5297–5307.

[11] H. Porav, W. Maddern, and P. Newman, “Adversar-ial training for adverse conditions: Robust metric lo-calisation using appearance transfer,” arXiv preprintarXiv:1803.03341, 2018.

[12] Y. Lu, Y.-W. Tai, and C.-K. Tang, “Conditional cycle-gan for attribute guided face image generation,” arXivpreprint arXiv:1705.09966, 2017.

[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative adversarial nets,” in Advances in neuralinformation processing systems, 2014, pp. 2672–2680.

[14] N. Sunderhauf, P. Neubert, and P. Protzel, “Are wethere yet? Challenging SeqSLAM on a 3000 km journeyacross all four seasons,” in Proc. of Workshop on Long-Term Autonomy, ICRA, 2013.

[15] P. Yin, Y. He, N. Liu, and J. Han, “Condition directedmulti-domain adversarial learning for loop closure de-tection,” arXiv preprint arXiv:1711.07657, 2017.

[16] M. A. E. Muhammed, A. A. Ahmed, and T. A. Khalid,“Benchmark analysis of popular imagenet classificationdeep cnn architectures,” in SmartTechCon, Aug 2017,pp. 902–907.

[17] J. Donahue, P. Krahenbuhl, and T. Darrell, “Adversar-ial feature learning,” arXiv preprint arXiv:1605.09782,2016.