Life-Long Place Recognition by Shared Representative ...fhan/publication/pdf/rssw16_sral.pdf ·...

Life-Long Place Recognition byShared Representative Appearance Learning

Fei Han∗, Xue Yang∗, Yiming Deng†, Mark Rentschler‡, Dejun Yang∗ and Hao Zhang∗∗Division of Computer Science, EECS Department, Colorado School of Mines, Golden, CO 80401

†Department of Electrical Engineering, University of Colorado Denver, CO 80204‡Department of Mechanical Engineering, University of Colorado Boulder, CO 80309

{fhan, xueyang}@mines.edu, [email protected], [email protected], {djyang, hzhang}@mines.edu

Abstract—Place recognition plays an important role in Simul-taneously Localization and Mapping (SLAM) in many roboticsapplications. When long-term SLAM is performed in an outdoorenvironment, additional challenges can be encountered, includingstrong perceptual aliasing and appearance variations caused bychanges of illumination, vegetation, weather, etc. Especially, whenconsidering life-long place recognition, these issues become morecritical. We introduce a novel Shared Representative AppearanceLearning (SRAL) approach for long-term loop closure detection.Different from previous place recognition methods using a singlefeature modality or multi-feature concatenation techniques, ourSRAL approach can autonomously learn a set of shared featuresthat are representative in all scene scenarios and integrate them to-gether to represent the long-term appearance of scenes observedby a robot during vision-based navigation. By formulating SRALas an optimization problem regularized by structured sparsity-inducing norms, we are able to model the interdependence of thefeature modalities. An optimization algorithm is also developedto efficiently solve the formulated convex optimization problem,which holds a theoretical guarantee to find the global optimum.Extensive experiments are performed to assess the SRAL methodusing large-scale benchmark datasets, including St Lucia, CMU-VL, and Nordland datasets. Experimental results have validatedthat SRAL outperforms existing place recognition methods, andobtains superior performance on life-long place recognition.

I. INTRODUCTION

Visual place recognition during a long time period is oneof the core capabilities needed by a modern intelligent robotto perform long-term visual Simultaneous Localization AndMapping (SLAM) [23], which thus has attracted an increasingattention in the robotics community [10, 29, 20, 18, 34, 7].Place recognition aims at identifying the locations previouslyvisited by a mobile robot, which enables the robot to localizeitself in an environment during navigation as well as correctincremental pose drifts during SLAM. However, visual placerecognition is a challenging problem due to robot movement,occlusion, dynamic objects (e.g., cars and pedestrians), andother vision challenges such as illumination changes. More-over, multiple locations may have similar appearance, whichresults in the so-called perceptual aliasing issue. Long-termplace recognition also introduces additional difficulties. Forexample, the same place during daytime and night, or acrossseasons, can look quite differently, which we name the Long-term Appearance Change (LAC) problem.

In the past several years, life-long place recognition hasbeen intensively investigated [23, 29, 35, 12, 28, 24], due to its

Fig. 1. Motivating examples of the algorithm based on learning the sharedrepresentative appearance for life-long place recognition. The same place canshow great appearance variations in different weather, vegetation, and time(Nordland Dataset). The proposed SRAL algorithm automatically identifies aset of heterogenous features (e.g., color, GIST, HOG, and LBP), which arenot only representative, but also shared by different scenarios, thus imagematching in these scenarios can be more accurately performed.

importance to long-term robot navigation. For example, FAB-MAP [6] is one of the first techniques to address life-long placerecognition for loop-closure detection in SLAM, which detectsrevisited places by matching individual images according totheir visual appearance, and was validated over 1000 km[7]. More recently, sequence-based image matching methodshave demonstrated very promising performance on long-termplace recognition, such as SeqSLAM [24], ABLE-M [2], andothers [29, 12, 15], which depend on sequence-based imagematching for long-term place recognition. Previous sequence-based place recognition methods either utilize a single visualfeature to encode place appearance, or combine two visualfeatures simply through concatenation [34, 38, 39]. However,these methods that rely on purely hand-crafted features arenot able to learn the importance of heterogenous featuresand effectively integrate the features together for robust placerecognition, especially across a long-term time span.

To construct a descriptive representation for life-long placerecognition, we propose a novel algorithm, referred to as theShared Representative Appearance Learning (SRAL), which

automatically learns a set of heterogenous visual features thatare important to all scene scenarios, and fuses them togetherto represent the long-term appearance of scenes observed bya robot during vision-based navigation. Our SRAL algorithmmodels the key insight that visual features that are not onlydiscriminative but also shared among all scene scenarios arethe best for long-term place recognition. If visual featuresare only highly discriminative to encode a specific scenario(e.g., winter), they are less useful to match images fromdifferent scenarios, because they can not encode other sce-narios. For the first time, our algorithm automatically learnsmultimodal shared representative features that are discrim-inative to represent the long-term visual appearance in allscene scenarios, with strong perceptual aliasing and LCAissues (e.g., during daytime and night, and/or across multipleseasons). We formulate this task as a multi-task multimodalfeature learning task, and solve this problem through sparseconvex optimization. Then, the fused multimodal, long-termrepresentation is used with SeqSLAM [24], a sequence-basedimage matching method, to robustly perform life-long placerecognition during daytime/night and across seasons.

Our contributions are threefold. First, we introduce a novelformulation of constructing long-term appearance representa-tions as a multitask, multimodal feature learning and fusionproblem, with the objective of automatically identifying a setof heterogenous visual features that are discriminative for allscene scenarios (e.g., daytime vs night). Second, we proposea novel SRAL algorithm that addresses this problem basedon sparse convex optimization regularized by two structuredsparsity-inducing norms to analyze the interrelationships ofmultimodal features and learn shared representative features.Third, a new optimization method is developed to solve theformulated problem, which is theoretically guaranteed to findthe optimal solution.

The rest of the paper is organized as follows. We describerelated research on long-term place recognition in Section II.Section III introduces the novel SRAL algorithm for long-termrepresentation learning. Experimental results are discussed inSection IV. Finally, we conclude our paper in Section V.

II. RELATED WORK

Existing SLAM techniques can be broadly classified intothree categories based on graph optimization, particle filter,and Extended Kalman Filter [37]. Based on the relationship ofvisual inputs and mapping outputs, SLAM can be categorizedinto 2D SLAM (using 2D images to produce 2D maps)[20, 15, 16], monocular SLAM (using 2D image sequencesto generate 3D maps) [8, 27], and 3D SLAM (using 3D pointclouds or RGB-D images to build 3D maps) [19, 33, 13]. Placerecognition is an integrated component for loop closuring inall visual SLAM methods, which employs visual features toidentify locations previously visited by robots. We provide anoverview of image matching methods and visual features usedin life-long place recognition.

A. Image Matching for Long-Term Place Recognition

Most previous place recognition techniques are based onimage-to-image matching, which are broadly categorized intotwo groups, based on pairwise similarity scoring and nearestneighbor search. Methods based on pairwise similarity scorescalculate a similarity score of the current scene image andeach template based on a certain distance metric and selectthe template that has the maximum score [5, 11]. On theother hand, techniques based on the nearest neighbor searchtypically build a search tree to efficiently identify the mostsimilar template to the current scene. For example, the FAB-MAP SLAM [7, 6] employs the Chow Liu tree to locatethe most similar template; RTAB-MAP [18, 19] uses the KDtree to perform fast nearest neighbor search; and other searchtrees are also used in different place recognition methods forefficient image-to-image matching [18, 2]. However, due tothe limited information provided by a single image, previousimage-based matching methods usually suffer from the LACand perceptual aliasing issues, and are not robust to performlife-long place recognition.

To integrate more information, several recent methods usea sequence of image frames to decrease the negative effectof LAC and perceptual aliasing and improve the accuracyof life-long place recognition [24, 2, 15, 14, 25]. Mostsequence-based place recognition techniques are based on asimilarity score computed from a pair of template and queryimage sequences. For example, SeqSLAM [24], one of theearliest sequence-based life-long place recognition methods,computes a sum of image-based similarity scores to matchimage sequences. The ABLE-M approach [2] concatenatesbinary features extracted from each image in a sequence asa representation to match between sequences. The similarityscores are also used by other sequence-based image matchingtechniques to perform life-long place recognition for robotnavigation [25, 14, 17, 15], which calculate a sequence-basedsimilarity score of the query and all template sequences toconstruct a similarity matrix, and then choose the templatesequence with a statistically high score as a match. To modeltemporal relationships of individual frames in the sequence,data association methods based on Hidden Markov Models(HMMs) [12], Conditional Random Fields (CRFs) [4], andgraph flow [28] were also proposed to align a pair of templateand query sequences.

However, the previous methods, usually based on a singletype of features, are not capable to effectively fuze the richinformation offered by multiple modalities of visual features.We address this problem by introducing a novel data fusionalgorithm that can work with SeqSLAM and other sequence-based or image-based image matching methods for life-longplace recognition.

B. Visual Features for Place Representation

A large number of visual features are proposed in previouslong-term place recognition techniques to encode the scenesobserved by robots during navigation, which can be broadlyclassified into two categories: local and global features.

Local features apply a detector to detect points of interest(e.g., corners) in an image and a descriptor to encode localinformation around each detected point of interest. Then, theBag-of-Words (BoW) model is typically employed by placerecognition methods to build a feature vector for each image.The Scale-Invariant Feature Transform (SIFT) feature is usedalong with a BoW model to detect revisited locations from2D images [1]. FAB-MAP [6, 7] applies the Speeded UpRobust Features (SURF) for visual place recognition. Bothfeatures are also used by the RTAB-MAP SLAM [18, 19].The binary BRIEF and FAST features are applied to build aBoW model to perform fast loop closure detection [9]. TheORB feature is also used for place recognition [26]. The localvisual features are usually discriminative to differentiate eachscene scenario (e.g., daytime vs night, or different seasons).However, they are often not shared by different scenarios andthus less representative to encode all scene scenarios for imagematching in long-term place recognition.

Global features extract information from the whole image,and a feature vector is often formed based on concatenationor feature statistics (e.g., histograms). These global featurescan encode raw image pixels, shape signatures, and otherinformation. GIST features [21], constructed from the responseof steerable filters at different orientations and scales, wereemployed to perform place recognition [34]. Local DifferenceBinary (LDB) features were applied to represent scenes bydirectly using intensity and gradient differences of image gridcells to calculate a binary string [2]. SeqSLAM [24] usedthe sum of absolute differences of low-resolution images asglobal features to conduct sequence-based place recognition.Convolutional neural networks (CNNs) were also utilized toextract deep features to match image sequences [29]. Otherglobal features were also proposed to perform life-long placerecognition and loop-closure detection [25, 28, 31]. Theseglobal features can encode whole image information and nodictionary-based quantization is required. They also showedpromising performance for long-term place recognition [23].However, the critical problem of identifying more discrimina-tive global features and effectively fuzing them together werenot well studied in previous methods based on global features,which is addressed in this paper.

III. SHARED REPRESENTATIVE APPEARANCE LEARNINGFOR LIFE-LONG PLACE RECOGNITION

We propose to learn and fuse a set of heterogenous featuresthat are representative and shared by multiple scenarios (e.g.,night vs daytime, or different seasons), which was not wellinvestigated in the previous research, to address the problemof life-long place recognition. We introduce the formulationof long-term feature learning and our novel SRAL algorithmfrom the perspective of sparse convex optimization. A new op-timization solver is also implemented to solve the formulatedproblem, which has a theoretical guarantee to converge to theglobal optimal solution.

Notation. In this paper, matrices are written as boldface,capital letters, and vectors are written as boldface lowercase

letters. Given a matrix M = {mij} ∈ Rn×m, we refer to itsi-th row and j-th column as mi and mj , respectively. The`1-norm of a vector v ∈ Rn is defined as ‖v‖1 =

∑ni=1 |vi|.

The `2-norm of v is defined as ‖v‖2 =√v>v. The `2,1-norm

of the matrix M is defined as:

‖M‖2,1 =∑n

i=1

√∑mj=1m

2ij =

∑ni=1 ‖mi‖2 , (1)

and the Frobenius norm is defined as:

‖M‖F =√∑m

i=1

∑nj=1m

2ij . (2)

A. Problem Formulation

Given a set of images acquired by a robot during navigationin multiple scenarios (e.g., daytime vs night, or differentseasons), the vectors of heterogenous features extracted fromthis image set are denoted as X = [x1, . . . ,xn] ∈ Rd×n, andtheir corresponding scene scenarios are represented as Y =[y1, . . . ,yn]> ∈ Rn×c, where xi = [(x1

i )>, . . . , (xmi )>]> ∈

Rd is a vector of heterogenous features from the i-th imageand consists of m modalities such that d =

∑mj=1 dj , and

yi ∈ Zc is the indicator vector of scene scenarios with yijindicating whether the i-th image belongs to the j-th scenescenario. Then, the task of shared representative appearancelearning can be considered as multitask feature learning andformulated as the following optimization problem:

minA‖X>W −Y‖2F + λ‖W‖2,1. (3)

where W ∈ Rd×c is a weight matrix with wij denoting theimportance of the i-th visual feature with respect to the j-thscene scenario, and λ is a trade-off parameter. The first termencoded by the squared Frobenius norm in Eq. (3) is a lossfunction that models the squared error of using the weightedfeatures to identify a scene scenarios. The second term basedon the `2,1-norm is a regularization that enforces the sparsityof the rows of W and the grouping effect of the weights ineach row. The sparsity enforced by the regularization allowsthe method to find discriminative features with larger weights(non-discriminative features have a weight that is very closeto 0); the grouping effect enforces the discriminative featureshave similar weights for all scene scenarios, since when theweights of a visual feature are equal for all scenarios, ‖mi‖2in the `2,1-norm as presented in Eq. (1) obtains the minimumvalue. Therefore, this regularization term models our insightof finding shared representative features to build a descriptiverepresentation that can represent long-term appearance varia-tions for accurate life-long place recognition.

Because different types of visual features typically capturedifferent attributes of the scene scenarios (e.g., color, shape,etc.), when multiple modalities of features are used together,it is important to model their underlying structure introducedby multiple modalities. To realize this capability, we definea new regularization term that enforces the grouping effectof the weights of features belonging to the same modality,as well as the sparsity between modalities to identify shared

Fig. 2. Illustration of the proposed structured sparsity-inducing norms. Giventhe feature weight matrix W = [w1;w2; · · · ;wm] ∈ <d×c, we model theinterrelationship of feature modalities using the M -norm regularization term,which allows our SRAL algorithm to identify the shared modalities acrossdifferent scenarios; we also introduce the `21-norm regularization to identifymore representative shared features. This figure is best viewed in color.

discriminative modalities, as follows:

‖W‖M =

m∑i=1

√√√√ di∑p=1

c∑q=1

(wipq)2 =

m∑i=1

‖Wi‖F , (4)

where Wi ∈ Rdi×c is the weight matrix for the i-th modality.Then, the final problem formulation of our SRAL algorithmbecomes:

minA‖X>W −Y‖2F + λ1‖W‖2,1 + λ2‖W‖M , (5)

which simultaneously considers the underlying structure offeature modalities and identifies shared representative visualfeatures to encode long-term appearance changes.

B. Place Recognition

The place recognition procedure can be partitioned intotwo phases according to our SRAL method, the weightlearning phase and fusion phase. After solving the problemin Eq.5 during first phase (solution is detailed in SectionIII-C), we can obtain the optimal weight matrix W∗ =[W1∗;W2∗; · · · ;Wm∗] ∈ <d×c. The optimal weight ωi, i =1, · · · ,m for each feature modality i can be obtained byωi = ‖Wi∗‖F .

Then, in the second fusion phase, given a new observationfeature sample x = [x1;x2; · · · ;xm] ∈ <d, we first calculatethe matching score si, i = 1, · · · ,m with respect to the queryimage and the existing image using each feature modalityseparately. Then the final score is obtained by the followingfusion method

s =

m∑i=1

ωisi, i = 1, · · · ,m, (6)

where ωi is the normalized weight for modality i. Finally, wecan determine whether two images are matched by comparingthe score with the predefined threshold.

Our fusion method can be applied in both single-image-based SLAM and sequence-based SLAM. A significant ad-vantage of our SRAL method is that we are able to learn rep-resentative feature modalities shared by different environmentscaused by changes of weather and/or illumination conditions.This fusion method improves the robustness of the featurematching for place recognition, which significantly benefitsloop closure detection in life-long SLAM.

C. Optimization Algorithm and Analysis

Because the objective in Eq. 5 comprises two non-smoothregularization terms: the `2,1-norm and M -norm, it is difficultto solve in general. To this end, we implement a new iterativealgorithm to solve the optimization problem in Eq. 5 with thenon-smooth regularization terms.

Taking the derivative of the objective with respect to wi(1 ≤i ≤ c), and setting it to zero vector, we obtain

XX>wi −Xyi + γ1Dwi + γ2Dwi = 0, (7)

where D is a diagonal matrix with the i-th diagonal elementas 1

2‖wi‖2 . D is a diagonal matrix with the i-th diagonal blockas 1

2‖Wi‖F Ii, Ii is an identity matrix with size of di, Wi isthe i-th row segment of W and includes the weights of thei-th feature modal. Thus we have

wi = (XX> + γ1D + γ2D)−1Xyi. (8)

It is noted that D and D are dependent on W and thus arealso unknown variables. An iterative algorithm is proposed tosolve this problem, which is described in Algorithm 1.

In the following, we analyze the algorithm convergence andprove that Algorithm 1 converges to the global optimum. First,we present a lemma from Nie et al. [30]:

Lemma 1: For any vector v and v, the following inequalityholds: ‖v‖2 − ‖v‖22

2‖v‖2 ≤ ‖v‖2 −‖v‖222‖v‖2 .

From Lemma 1, we can easily obtain the following corol-lary.

Corollary 1: For any matrix M and M, the followinginequality holds: ‖M‖F − ‖M‖2F

2‖M‖F ≤ ‖M‖F −‖M‖2F2‖M‖F .

Theorem 1: Algorithm 1 monotonically decreases the ob-jective value of the problem in Eq. 5 in each iteration.

Proof: According to Steps 3 of Algorithm 1, we know

W(t+ 1) = argminW

‖X>W −Y‖2F (9)

+γ1TrW>D(t+ 1)W + γ2TrW

>D(t+ 1)W.

Then, we can derive that

J (t+ 1) + γ1TrW>(t+ 1)D(t+ 1)W(t+ 1)

+γ2TrW>(t+ 1)D(t+ 1)W(t+ 1)

≤ J (t) + γ1TrW>(t)D(t+ 1)W(t)

+γ2TrW>(t)D(t+ 1)W(t),

where J (t) = ‖X>W(t)−Y‖2F .

After substituting the definition of D and D, we obtain

J (t+ 1) + γ1

d∑i=1

‖wi(t+ 1)‖222‖wi(t)‖2

(10)

+γ2

m∑i=1

‖Wi(t+ 1)‖2F2‖Wi(t)‖F

≤ J (t) + γ1

d∑i=1

‖wi(t)‖222‖wi(t)‖2

+ γ2

m∑i=1

‖Wi(t)‖2F2‖Wi(t)‖F

,

From Lemma 1 and Corollary 1, we can derive that

d∑i=1

‖wi(t+ 1)‖2 −d∑

i=1

‖wi(t+ 1)‖222‖wi(t)‖2

≤

d∑i=1

‖wi(t)‖2 −d∑

i=1

‖wi(t)‖222‖wi(t)‖2

, (11)

andm∑i=1

‖Wi(t+ 1)‖F −m∑i=1

‖Wi(t+ 1)‖2F2‖Wi(t)‖F

≤

m∑i=1

‖Wi(t)‖F −m∑i=1

‖Wi(t)‖2F2‖Wi(t)‖F

. (12)

Adding Eq. 10-12 on both sides, we have

J (t+ 1) + γ1

d∑i=1

‖wi(t+ 1)‖2 (13)

+γ2

m∑i=1

‖Wi(t+ 1)‖F

≤ J (t) + γ1

d∑i=1

‖wi(t)‖2 + γ2

m∑i=1

‖Wi(t)‖F .

Therefore, Algorithm 1 monotonically decreases the objectivevalue in each iteration.

Because the optimization problem in Eq. 5 is convex, thealgorithm converges to the global optimal solution. Also, sincethe solution has a closed form in each iteration, Algorithm 1converges very fast.

IV. EXPERIMENTAL RESULTS

A. Experiment Setup

We use three large-scale benchmark datasets to validate theperformance of our SRAL method in different conditions dur-ing various time spans. The dataset statistics are summarizedin Table I. Four types of visual features were employed in ourexperiments for all of these datasets, which are color features[22] applied on 32 × 24 downsampled images, GIST features[21] applied on 320 × 240 downsampled images, HOGfeatures [28] applied on 320 × 240 downsampled images,and LBP features [32] applied on 320 × 240 downsampledimages. Instead of concatenating all these features togetherinto a large vector for the scene representation, we need tolearn the weight of each feature in order to select features that

Algorithm 1: An efficient algorithm to solve the optimiza-tion problem in Eq.5

Input : X = [x1, · · · ,xn] ∈ <d×n andY = [y1, · · · ,yn]

> ∈ <n×c

1: Let t = 1. Initialize W(t) by solvingminW‖X>W −Y‖2F .

2: while not converge do3: Calculate the block diagonal matrix D(t+ 1), where

the i-th diagonal element of D(t+ 1) is 12‖wi(t)‖2 Ii.

Calculate the block diagonal matrix D(t+ 1), wherethe i-th diagonal block of D(t+ 1) is 1

2‖Wi(t)‖F Ii.4: For each wi(1 ≤ i ≤ c), wi(t+ 1) =(

XX> + γ1D(t+ 1) + γ2D(t+ 1))−1

Xyi.5: t = t+ 1.6: end

Output: W = W(t) ∈ <d×c

TABLE ISTATISTICS AND SCENARIOS OF THE PUBLIC BENCHMARK DATASETS

USED FOR VALIDATION OF OUR SRAL METHOD

Dataset Sequence Image Statistics Scenario

St Lucia [10] 10× 12 km 10× ∼ 22, 000 frames640× 480 at 15 FPS

Different timesof the day

CMU-VL [3] 5× 8 km 5× ∼ 13, 000 frames1024× 768 at 15 FPS Different months

Nordland [35] 4× 728 km 4× ∼ 900, 000 frames1920× 1080 at 25 FPS Different seasons

are robust to various environments in order for the long termplace recognition objective in SLAM loop closure detection.

We implement both the proposed SRAL method and thesimple multi-modal feature concatenation method for long-term place recognition tasks. Our implementation was per-formed using a mixture of unoptimized Matlab and C++ ona Linux laptop with an i7 3.0 GHz GPU, 16GB memory and2GB GPU. Similar to other state-of-the-art methods [29, 36],the implementation in this current stage is not able to performlarge-scale long-term loop closure detection in real time. A keylimiting factor is that the runtime is proportional to the numberof previously visited places. Utilizing memory managementtechniques [19], combined with an optimized implementa-tion, can potentially overcome this challenge to achieve real-time performance. In these experiments, we qualitatively andquantitatively evaluate our algorithms, and compare them withseveral state-of-the-art methods, including BRIEF-GIST [34](using single BRIEF feature), SeqSLAM [24] (using singlegrayscale image feature), etc.

B. Results on the St Lucia Dataset (Various Time of the Day)

The St Lucia dataset [10] was collected by a single camerainstalled on a car in the suburban area of St Lucia in Australiaat various times over several days during a two-week period.Each data instance includes a video of 20-25 minutes. GPSdata was also recorded, which is used in the experiment as the

(a) Matched images

Recall

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

BRIEF-GIST

Feature Concatenation

SRAL

SeqSLAM

Color

LBP

(b) Precision-recall curves

Color 0.6%

GIST 40%

HOG 59.1%

LBP 0.3%

(c) Feature weight

Fig. 3. Experimental results over the St Lucia dataset. Fig. 3(a) presents anexample showing the matched images recorded at 15:45 on 08/18/2009 and10:00 on 09/10/2009, respectively. Fig. 3(b) illustrates the precision-recallcurves that indicate the performance of our SRAL algorithms. Quantitativecomparisons with some of the main state-of-the-art loop closure detectionmethods are shown in Fig. 3(b). Fig. 3(c) illustrates the weights of eachfeature modality learned automatically by Algorithm 1. The figures are bestseen in color.

ground truth for place recognition. The dataset contains severalchallenges including appearance variations due to illuminationchanges at different times of a day, dynamic objects includingpedestrians and vehicles, and viewpoint variations due to slightroute deviations. The dataset statistics is shown in Table I.

Loop closure detection results over the St Lucia dataset areillustrated in Fig. 3. The quantitative performance is evaluatedusing a standard precision-recall curve, as shown in Fig. 3(b).The high precision and recall values indicate that our SRALmethods with `2,1-norm and M -norm obtain high performanceand well match morning and afternoon video sequences.We also implement the feature concatenation method, whichsimply concatenate all feature modalities into a big feature forthe scene representation. Our SRAL method is able to learn theweight of each feature modality, which enables more powerfuland robust scene representation abilities. The precision andrecall curve of these two methods can be found in Fig. 3(b),showing that our SRAL method outperforms the simple featureconcatenation scheme, which underscores the importance ofrepresentative feature modalities with shared ability. To quali-tatively evaluate the experimental results, an intuitive exampleof the sequence-based matching is presented in Fig. 3(a). Weshow the image (upper row of Fig. 3(a)) that has the maximummatching score for a query image (bottom row of Fig. 3(a))within the sequence. This qualitative results demonstrate thatthe proposed SRAL algorithm works well with the presence of

dynamic objects and other vision challenges including cameramotions and illumination changes.

Comparisons with some of the main state-of-the-art methodsare also graphically presented in Fig. 3(b). It is observed thatfor long-term loop closure detection, our SRAL algorithmoutperforms the methods based on single features, includingBRIEF-GIST [34], SeqSLAM [24], color-based SLAM [22],and LBP-based localization [32], due to the significant ap-pearance variations of the same location at different times. Itcan be observed from the matching images in Fig. 3(a) thatscene appearance varies a lot during different times of the day,which causes some features less representative to identify thesame place in different scenarios. After the feature modalitylearning procedure in our SRAL method, the weights of theseless useful features will be decreased, resulting in better finalplace recognition performance.

(a) Matched sequences

Recall

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

0

0.2

0.4

0.6

0.8

1

BRIEF-GIST


SRAL

SeqSLAM

Color

LBP


Color 9.79%

GIST 0.98%

HOG 88.25%

LBP 0.98%

(c) Feature weight

Fig. 4. Experimental results over the CMU-VL dataset. Fig. 4(a) presentsan example demonstrating the matched images and query images recorded inDecember and October, respectively. Fig. 4(b) illustrates the precision-recallcurves that indicate the performance of our SRAL algorithms. Quantitativecomparisons with some of the main state-of-the-art loop closure detectionmethods are also shown in Fig. 4(b). Fig. 4(c) illustrates the weights of eachfeature modality learned automatically by Algorithm 1. The figures are bestseen in color.

C. Results on the CMU-VL Dataset (Different Months)

The CMU Visual Localization (VL) dataset [3] was gatheredusing two cameras installed on a car that traveled the sameroute five times in Pittsburgh areas in the USA during differentmonths in varying climatological, environmental and weatherconditions. GPS information is also available, which is used asthe ground truth for algorithm evaluation. This dataset containsseasonal changes caused by vegetation, snow, and illuminationvariations, as well as urban scene changes due to constructions

and dynamic objects. The visual data from the left camera isused in this set of experiments.

The qualitative and quantitative testing results obtained byour SRAL algorithms on the CMU-VL dataset are graphicallyshown in Fig. 4. The qualitative results in Fig. 4(a) show theimages (upper row) with the maximum score for each queryimage (bottom row) in an observed sequence. It is clearlyobserved that our SRAL method is able to well match sceneimages and recognize same locations across different monthsthat exhibit significant weather, vegetation, and illuminationchanges. The quantitative experimental results in Fig. 4(b)indicate that the SRAL methods with M -norm and `2,1-norm regulations obtain significant performance. In orderto demonstrate the significance of our SRAL method withweighted feature fusion, we also implemented the baselinefeature concatenation method and compared its performance.The comparison result in Fig. 4(b) indicates that our SRALmethod is still able to address the LAC problem and obtainbetter scene recognition result, which is the same phenomenonobserved in the experiment using the St Lucia dataset.

Fig. 4(b) also illustrates comparisons of our SRAL methodwith several previous loop closure detection approaches, whichillustrates the same conclusion as in the St Lucia experimentthat our SRAL loop closure detection approaches significantlyoutperform methods based on single feature matching for long-term place recognition.

(a) Matched sequences

Recall

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

0

0.2

0.4

0.6

0.8

1BRIEF-GIST


SRAL

SeqSLAM

Color

LBP


Color 2.49%

GIST 0.2%

HOG 96.68%

LBP 0.64%

(c) Feature weights

Fig. 5. Experimental results over the Nordland dataset. Fig. 5(a) presents anexample illustrating the matched images and query images recorded winterand spring, respectively. Fig. 5(b) shows the precision-recall curves thatindicate the performance of our SRAL algorithms. Quantitative comparisonswith some of the main state-of-the-art loop closure detection methods are alsoshown in Fig. 5(b). Fig. 5(c) illustrates the weights of each feature modalitylearned automatically by Algorithm 1. The figures are best seen in color.

D. Results on the Nordland Dataset (Different Seasons)

The Nordland dataset [35] contains visual data from a ten-hour long journey of a train traveling around 3000 km, whichwas recorded in four seasons from the viewpoint of the train’sfront cart. GPS data was also collected, which is employed asthe ground truth for algorithm evaluation. Because the datasetwas recorded across different seasons, the visual data containsstrong scene appearance variations in different climatological,environmental and illumination conditions. In addition, sincemultiple places of the wilderness during the trip exhibit similarappearances, this dataset contains strong perceptual aliasing.The visual information is also very limited in several locationssuch as tunnels. The difficulties make the Nordland dataset oneof the most channelling datasets for loop closure detection.

Fig. 5 presents the experimental results obtained by match-ing the spring data to the winter video in the Nordland dataset.Fig. 5(a) demonstrates several example images from one ofthe matched sequences (upper row) and query (bottom row)images within the sequence. Fig. 5(a) validates that our SRALalgorithm can accurately match images that contain dramaticappearance changes across seasons. Fig. 5(b) illustrates thequantitative result obtained by our SRAL algorithms andthe comparison with the previous techniques, showing theimprovement of our SRAL method in comparison to existingmethods. Also, it can be observed from Fig. 5(b) that ourSRAL method outperforms the feature concatenation method,showing the significance of feature weight learning.

V. CONCLUSION

In this paper, we propose a novel SRAL algorithm thatis able to learn the shared representative features and fusethe learned heterogenous feature modalities together to ad-dress the LAC problem and accurately perform life-longplace recognition. The shared representability of each featuremodality is automatically learned, which is formulated asan optimization problem regularized by structured sparsity-inducing norms. A new optimization algorithm is also imple-mented to efficiently solve the formulated problem. To evaluateour SRAL algorithm, extensive experiments are performedusing three large-scale benchmark datasets. The qualitativeresults have validated that SRAL is able to robustly performlong-term place recognition under significant scene variationsacross various environments in different days, months andseasons. In addition, the quantitative evaluation has shown thatthe proposed algorithm outperforms previous techniques andobtains a promising life-long place recognition performance.

REFERENCES

[1] Adrien Angeli, David Filliat, Stephane Doncieux, andJean-Arcady Meyer. Fast and incremental method forloop-closure detection using bags of visual words. TRO,24(5):1027–1037, 2008.

[2] R. Arroyo, P.F. Alcantarilla, L.M. Bergasa, andE. Romera. Towards life-long visual localization usingan efficient matching of binary sequences from images.In ICRA, 2015.

[3] Hernan Badino, Daniel Huber, and Takeo Kanade. Real-time topometric localization. In ICRA, 2012.

[4] Cesar Cadena, Dorian Galvez-Lopez, Juan D Tardos,and Jose Neira. Robust place recognition with stereosequences. TRO, 28(4):871–885, 2012.

[5] Cheng Chen and Han Wang. Appearance-based topo-logical bayesian inference for loop-closing detection in across-country environment. IJRR, 25(10):953–983, 2006.

[6] Mark Cummins and Paul Newman. FAB-MAP: Proba-bilistic localization and mapping in the space of appear-ance. IJRR, 27(6):647–665, 2008.

[7] Mark Cummins and Paul Newman. Highly scalableappearance-only SLAM-FAB-MAP 2.0. In RSS, 2009.

[8] Jakob Engel, Thomas Schops, and Daniel Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In ECCV,2014.

[9] Dorian Galvez-Lopez and Juan D Tardos. Bags of binarywords for fast place recognition in image sequences.TRO, 28(5):1188–1197, 2012.

[10] Arren J Glover, William P Maddern, Michael J Mil-ford, and Gordon F Wyeth. FAB-MAP + RatSLAM:appearance-based slam for multiple times of day. InICRA, 2010.

[11] Jens-Steffen Gutmann and Kurt Konolige. Incrementalmapping of large cyclic environments. In ISCIRA, 1999.

[12] Paul Hansen and Brett Browning. Visual place recogni-tion using HMM sequence matching. In IROS, 2014.

[13] Peter Henry, Michael Krainin, Evan Herbst, XiaofengRen, and Dieter Fox. RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoorenvironments. IJRR, 31(5):647–663, 2012.

[14] Kin Leong Ho and Paul Newman. Detecting loop closurewith scene sequences. IJCV, 74(3):261–286, 2007.

[15] Edward Johns and Guang-Zhong Yang. Fea-ture co-occurrence maps: Appearance-based localisationthroughout the day. In ICRA, 2013.

[16] Michael Kaess, Hordur Johannsson, Richard Roberts,Viorela Ila, John J Leonard, and Frank Dellaert. iSAM2:Incremental smoothing and mapping using the Bayestree. IJRR, 31(2):216–235, 2012.

[17] Manfred Klopschitz, Christopher Zach, Arnold Irschara,and Dieter Schmalstieg. Generalized detection and merg-ing of loop closures for video sequences. In 3D DataProcessing, Visualization, and Transmission, 2008.

[18] Mathieu Labbe and Francois Michaud. Appearance-based loop closure detection for online large-scale andlong-term operation. TRO, 29(3):734–745, 2013.

[19] Mathieu Labbe and Francois Michaud. Online glob-al loop closure detection for large-scale multi-sessiongraph-based SLAM. In IROS, 2014.

[20] Yasir Latif, Cesar Cadena, and Jose Neira. Robust loopclosing over time for pose graph SLAM. IJRR, pages1611–1626, 2013.

[21] Yasir Latif, Guoquan Huang, John Leonard, and JoseNeira. An online sparsity-cognizant loop-closure algo-rithm for visual navigation. In RSS, 2014.

[22] Donghwa Lee, Hyongjin Kim, and Hyun Myung. 2Dimage feature-based real-time RGB-D 3D SLAM. InRITA, pages 485–492. 2013.

[23] Stephanie Lowry, Niko Sunderhauf, Paul Newman,John J Leonard, David Cox, Peter Corke, and Michael JMilford. Visual place recognition: A survey. TRO, 2015.

[24] Michael J Milford and Gordon F Wyeth. SeqSLAM:Visual route-based navigation for sunny summer daysand stormy winter nights. In ICRA, 2012.

[25] Michael J Milford, Gordon F Wyeth, and DF Rasser.RatSLAM: a hippocampal model for simultaneous local-ization and mapping. In ICRA, 2004.

[26] Raul Mur-Artal and Juan D Tardos. Fast relocalisationand loop closing in keyframe-based SLAM. In ICRA,2014.

[27] Raul Mur-Artal, Jose Maria Martinez Montiel, andJuan D Tardos. ORB-SLAM: a versatile and accuratemonocular slam system. TRO, 31(5):1147–1163, 2015.

[28] Tayyab Naseer, Luciano Spinello, Wolfram Burgard, andCyrill Stachniss. Robust visual robot localization acrossseasons using network flows. In AAAI, 2014.

[29] Tayyab Naseer, Michael Ruhnke, Cyrill Stachniss, Lu-ciano Spinello, and Wolfram Burgard. Robust visualSLAM across seasons. In IROS, 2015.

[30] Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding.Efficient and robust feature selection via joint `2,1-normsminimization. In NIPS, 2010.

[31] Edward Pepperell, Peter Corke, and Michael J Milford.All-environment visual place recognition with SMART.In ICRA, 2014.

[32] Yongliang Qiao, Cindy Cappelle, and Yassine Ruichek.Place recognition based visual localization using LBPfeature and SVM. In Advances in Artificial Intelligenceand Its Applications, pages 393–404. 2015.

[33] Jurgen Sturm, Nikolas Engelhard, Felix Endres, WolframBurgard, and Daniel Cremers. A benchmark for theevaluation of RGB-D SLAM systems. In IROS, 2012.

[34] Niko Sunderhauf and Peter Protzel. BRIEF-Gist –Closing the loop by simple means. In IROS, 2011.

[35] Niko Sunderhauf, Peer Neubert, and Peter Protzel. Arewe there yet? challenging seqslam on a 3000 km journeyacross all four seasons. In ICRAW, 2013.

[36] Niko Sunderhauf, Sareh Shirazi, Adam Jacobson, FerasDayoub, Edward Pepperell, Ben Upcroft, and MichaelMilford. Place recognition with ConvNet landmarks:Viewpoint-robust, condition-robust, training-free. In RSS,2015.

[37] Sebastian Thrun and John J Leonard. Simultaneouslocalization and mapping. In Springer Handbook ofRobotics, pages 871–889. 2008.

[38] Xue Yang, Fei Han, Hua Wang, and Hao Zhang. Enforc-ing template representability and temporal consistencyfor adaptive sparse tracking. In IJCAI, 2016.

[39] Hao Zhang, Fei Han, and Hua Wang. Robust multimodalsequence-based loop closure detection via structuredsparsity. In RSS, 2016.

Life-Long Place Recognition by Shared Representative ...fhan/publication/pdf/rssw16_sral.pdf ·...

Documents

Transcript of Life-Long Place Recognition by Shared Representative ...fhan/publication/pdf/rssw16_sral.pdf ·...