EAO-SLAM: Monocular Semi-Dense Object SLAM Based on ... EAO-SLAM: Monocular Semi-Dense Object SLAM...

download EAO-SLAM: Monocular Semi-Dense Object SLAM Based on ... EAO-SLAM: Monocular Semi-Dense Object SLAM Based

of 8

  • date post

    25-Jul-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of EAO-SLAM: Monocular Semi-Dense Object SLAM Based on ... EAO-SLAM: Monocular Semi-Dense Object SLAM...

  • EAO-SLAM: Monocular Semi-Dense Object SLAM Based on Ensemble Data Association

    Yanmin Wu1, Yunzhou Zhang1,2, Delong Zhu3, Yonghui Feng2, Sonya Coleman4 and Dermot Kerr4

    Abstract— Object-level data association and pose estimation play a fundamental role in semantic SLAM, which remain unsolved due to the lack of robust and accurate algorithms. In this work, we propose an ensemble data associate strategy for integrating the parametric and nonparametric statistic tests. By exploiting the nature of different statistics, our method can effectively aggregate the information of different measurements, and thus significantly improve the robustness and accuracy of data association. We then present an accurate object pose estimation framework, in which an outliers-robust centroid and scale estimation algorithm and an object pose initialization algorithm are developed to help improve the optimality of pose estimation results. Furthermore, we build a SLAM system that can generate semi-dense or lightweight object-oriented maps with a monocular camera. Extensive experiments are conducted on three publicly available datasets and a real scenario. The results show that our approach significantly outperforms state- of-the-art techniques in accuracy and robustness. The source code is available on https://github.com/yanmin-wu/ EAO-SLAM.

    I. INTRODUCTION Conventional visual SLAM systems have achieved signif-

    icant success in robot localization and mapping tasks. More efforts in recent years are evolved in making SLAM serve for robot navigation, object manipulation, and environment representation. Semantic SLAM is a promising technique for enabling such applications and receives much attention from the community [1]. In addition to the conventional functions, semantic SLAM also focuses on a detailed expression of the environment, e.g., labeling map elements or objects of interests, to support different high-level applications.

    Object SLAM is a typical application of semantic SLAM, and the goal is to estimate more robust and accurate camera poses by leveraging the semantic information of in-frame objects [2]–[4]. In this work, we further extend the content of object SLAM by enabling it to build lightweight and object- oriented maps, demonstrated in Fig. 1, in which the objects

    This work was supported by National Natural Science Foundation of China (No. 61973066,61471110) , Equipment Pre-research Funda- tion(61403120111), the Fundation of Key Laboratory of Aerospace Sys- tem Simulation(6142002301), the Fundation of Key Laboratory of Equip- ment Reliability(61420030302), Natural Science Foundation of Liaoning (No.20180520040), and Fundamental Research Funds for the Central Uni- versities(N172608005, N182608004).

    1Yanmin Wu is with Faculty of Robot Science and Engineering, North- eastern University, Shenyang, China.

    2Yunzhou Zhang and Yonghui Feng are with College of Information Science and Engineering, Northeastern University, Shenyang 110819, China (Corresponding author: Yunzhou Zhang, Email: zhangyunzhou@mail.neu.edu.cn).

    3Delong Zhu is with the Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China.

    4Sonya Coleman and Dermot Kerr are with School of Computing and Intelligent Systems,Ulster University, N. Ireland, UK.

    Fig. 1: A lightweight and object-oriented semantic map.

    are represented by cubes or quadrics with their locations, orientations, and scales accurately registered.

    The challenges of object SLAM mainly lie in two folds: 1) Existing data association methods [5]–[7] are not robust or accurate for tackling complex environments that contain multiple object instances. There are no practical solutions to systematically address this problem. 2) Object pose estima- tion is not accurate, especially for monocular object SLAM. Although some improvements are achieved in recent studies [8]–[10], they are typically dependent on strict assumptions, which are hard to fulfill in real-world applications.

    In this paper, we propose the EAO-SLAM, a monocular object SLAM system, to address the data association and pose estimation problems. Firstly, we integrate the parametric and nonparametric statistic tests, and the traditional IoU- based method, to conduct model ensembling for data associ- ation. Compared with conventional methods, our approach sufficiently exploits the nature of different statistics, e.g., Gaussian and non-Gaussian measurements, hence exhibits significant advantages in association robustness. For object pose estimation, we propose a centroid and scale estimation algorithm and an object pose initialization approach based on the isolation forest (iForest). The proposed methods are robust to outliers and exhibit high accuracy, which signifi- cantly facilitates the joint pose optimization process.

    The contributions of this paper are summarized as follows:

    • We propose an ensemble data association strategy that can effectively aggregate different measurements of the objects to improve association accuracy.

    • We propose an object pose estimation framework based

    ar X

    iv :2

    00 4.

    12 73

    0v 2

    [ cs

    .R O

    ] 2

    9 Ju

    l 2 02

    0

    https://github.com/yanmin-wu/EAO-SLAM https://github.com/yanmin-wu/EAO-SLAM

  • on iForest, which is robust to outliers and can accurately estimate the locations, poses, and scales of objects.

    • Based on the proposed method, we implement the EAO- SALM to build lightweight and object-oriented maps.

    • We conduct comprehensive experiments and verify the effectiveness of our proposed methods on publicly avail- able datasets and the real scenario. The source code of this work is also released.

    II. RELATED WORK

    A. Data Association

    Data association is an indispensable ingredient for se- mantic SLAM, which is used to determine whether the object observed in the current frame is an existing object in the map. Bowman et al. [5] use a probabilistic method to model the data association process and leverage the EM algorithm to find correspondences between observed landmarks. Subsequent studies [7], [11] further extend the idea to associate dynamic objects or conduct semantic dense reconstruction. These methods can achieve high association accuracy, but can only process a limited number of object instances. Their efficiency also remains to be improved due to the expensive EM optimization process [12]. Object track- ing is another commonly-used approach in data association. Li et al. [13] propose to project 3D cubes to the image plane and then leverage the Hungarian tracking algorithm to conduct association using the projected 2D bounding boxes. Tracking-based methods perform high runtime efficiency, but can easily generate incorrect priors in complex environments, yielding incorrect association results.

    In recent studies, more data association approaches are developed based on maximum shared information. Liu et al. [14] propose random walk descriptors to represent the topological relationships between objects, and those with the maximum number of shared descriptors are regarded as the same instance. Instead, Yang et al. [8] propose to directly count the number of matched map points on the detected objects as association criteria, yielding a much efficient performance. Grinvald et al. [2] propose to measure the similarity between semantic labels and Ok et al. [3] propose to leverage the correlation of hue saturation histogram. The major drawback of these methods is that the designed features or descriptors are typically not general or robust enough and can easily cause incorrect associations.

    Weng et al. [15] for the first time propose nonparametric statistical testing for semantic data association, which can address the problems in which the statistics do not follow a Gaussian distribution. Later on, Iqbal et al. [6] also verify the effectiveness of nonparametric data association. However, this method cannot address the statistics that follow Gaussian distributions effectively, hence cannot sufficiently exploit different measurements in SLAM. Based on this observation, we combine the parametric and nonparametric methods to perform model ensembling, which exhibits superior associa- tion performance in the complex scenarios with the presence of multiple categories of objects.

    B. Object SLAM

    Benefiting from deep learning techniques [16], [17], object detection is robustly integrated into the SLAM framework for labeling objects of interests in the map. The exploitation of in-frame objects significantly enlarges the application scopes of traditional SLAM. Some studies [15], [18], [19] treat objects as landmarks to estimate camera poses or for relocalization [13]. Some studies [20] leverage object size to constrain the scale of monocular SLAM, or remove dynamic objects to improve pose estimation accuracy [7], [21]. In recent years, the combination of object SLAM and grasping [22] has also attracted many interests, and facilitate the research on autonomous mobile manipulation.

    Object models in semantic SLAM can be broadly divided into three categories: instance-level models, category-specific models, and general models. The instance-level models [9], [23] depend on a well-established database that records all the related objects. The prior information of objects provides important object-camera constraints for graph optimization. Since the models need to be known in advance, the ap- plication scenarios of such methods are limited. There are also some studies on category-specific models, which focus on describing category-level features. For example, Parkhiya et al. [10] and Joshi et al. [19] use the CNN network to estimate the