Monocular Multibody Visual SLAM - IIIT...

Monocular Multibody Visual SLAM

Thesis submitted in partial fulfillment

of the requirements for the degree of

MS by Research

in

Computer Science and Engineering

by

Abhijit Kundu

200807030

[email protected]

Robotics Research Lab

International Institute of Information Technology

Hyderabad - 500 032, INDIA

April 2011

Copyright c© Abhijit Kundu, 2011

All Rights Reserved

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Monocular Multibody Visual SLAM”

by Abhijit Kundu, has been carried out under our supervision and is not submitted elsewhere

for a degree.

Date Adviser: Dr. K. Madhava Krishna and Dr. C. V. Jawahar

To Robots and Humans trying to achieve Singularity

Acknowledgments

I am extremely grateful to my advisors, Dr. Madhava Krishna and Prof. C. V. Jawahar,

for their guidance, help, and encouragement. Specifically, I am for thankful to them for the

thought-provoking and insightful discussions about this work during the weekly meetings over

last two years. And on the first place, I am thankful to Dr. Krishna for helping me to join the

MS program and Robotics Lab here.

Acknowledgments are also due to all fellow students and colleagues at IIIT Hyderabad for

their ideas and comments on my research, technical discussions, and most importantly their

friendship. I feel lucky to be a member of a wonderful research community at IIIT.

Finally I would thank my parents and family for supporting me during my studies not just

here, but throughout my whole life.

v

Abstract

Vision based SLAM [11, 21, 23, 33, 51] and SfM systems [17] have been the subject of

much research and are finding applications in many areas like robotics, augmented reality, city

mapping. But almost all these approaches assume a static environment, containing only rigid,

non-moving objects. Moving objects are treated the same way as outliers and filtered out using

robust statistics like RANSAC. Though this may be a feasible solution in less dynamic environ-

ments, but it soon fails as the environment becomes more and more dynamic. Also accounting

for both the static and moving objects provides richer information about the environment. A

robust solution to the SLAM problem in dynamic environments will expand the potential for

robotic applications, like in applications which are in close proximity to human beings and other

robots. Robots will be able to work not only for people but also with people.

This thesis presents a realtime, incremental multibody visual SLAM system that allows

choosing between full 3D reconstruction or simply tracking of the moving objects. Motion re-

construction of dynamic points or objects from a monocular camera is considered very hard

due to well known problems of observability. We attempt to solve the problem with a Bearing

only Tracking (BOT) and by integrating multiple cues to avoid observability issues. The BOT

is accomplished through a particle filter, and by integrating multiple cues from the reconstruc-

tion pipeline. With the help of these cues, many real world scenarios which are considered

unobservable with a monocular camera is solved to reasonable accuracy. This enables build-

ing of a unified dynamic 3D map of scenes involving multiple moving objects. Tracking and

reconstruction is preceded by motion segmentation and detection which makes use of efficient

geometric constraints to avoid difficult degenerate motions, where objects move in the epipolar

plane. Results reported on multiple challenging real world image sequences verify the efficacy

of the proposed framework.

vi

Own Publications

[1] Abhijit Kundu, C. V. Jawahar and K. M. Krishna. Realtime Multibody Visual SLAM with

a Smoothly Moving Monocular Camera. International Conference on Computer Vision

(ICCV). 2011. (Accepted)

[2] Abhijit Kundu, K. M. Krishna and C. V. Jawahar. Realtime Motion Segmentation based

Multibody Visual SLAM. Indian Conference on Computer Vision, Graphics and Image

processing (ICVGIP). 2010. (Best Paper Award)

[3] Abhijit Kundu, C. V. Jawahar and K. M. Krishna. Realtime Moving Object Detection

from a Freely Moving Monocular Camera. IEEE International Conference on Robotics

and Biomimetics (ROBIO) . 2010.

[4] Abhijit Kundu and K. M. Krishna. Moving Object Detection by Multi-View Geometric

Techniques from a Single Camera Mounted Robot. IEEE/RSJ International Conference

on Intelligent Robots and Systems (IROS). 2009.

Publications can be downloaded from http://abhijitkundu.info/

vii

http://abhijitkundu.info/

Contents

Chapter Page

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Own Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Why SLAM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Why Monocular Vision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Why Multibody Visual SLAM? . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Related work and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Overview and layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.3 Thesis layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Feature Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Feature Detectors and Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Feature Matching Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Dense Feature Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Moving Object Detection and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Initialization of Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Geometric Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.1 Epipolar Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.2 Flow Vector Bound (FVB) Constraint . . . . . . . . . . . . . . . . . . . . 18

3.6 Independent Motion Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 Clustering Unmodeled Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.8 Computation of Fundamental Matrix from Odometry . . . . . . . . . . . . . . . . 21

viii

CONTENTS ix

3.8.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.8.2 Robot-Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 223.8.3 Preventing Odometry Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.9 Results of Moving Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 223.9.1 Robot mounted Camera Sequence . . . . . . . . . . . . . . . . . . . . . . 233.9.2 Handheld Indoor Lab Sequence . . . . . . . . . . . . . . . . . . . . . . . . 263.9.3 Detection of Degenerate Motions . . . . . . . . . . . . . . . . . . . . . . . 273.9.4 Person detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Visual SLAM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Visual SLAM Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Feedback from Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Dealing Degenerate Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Moving Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.1 Particle Filter based BOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Ground Plane, Depth Bound & Size Bound . . . . . . . . . . . . . . . . . 345.1.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.1.3 Integrating Depth and Velocity Constraints . . . . . . . . . . . . . . . . . 35

6 Unification: putting evrything together . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1 Relative Scale Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Feedback from SfM to BOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3 Dynamic 3D Occupancy Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.4 Multibody Reconstruction Results . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.4.1 Camvid Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4.2 New College Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.3 Versailles Rond Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.4 Moving Box Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.5.1 Comparison of different cues to BOT . . . . . . . . . . . . . . . . . . . . . 426.5.2 Smooth Camera Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 1

Introduction

For mobile robots to be able to work with and for people and thus operate in our everyday

environments, they need to be able to acquire knowledge through perception. In other words

they need to collect sensor measurements from which they extract meaningful information about

the scene. Vision is an extraordinarily powerful sense for this. The capability of computers to

be able to “see” has been the vision of computer vision and robot vision researchers for many

years. We humans do this effortlessly, and it often seems very straightforward as seen in many

popular science fiction (see Fig. 1.1). However this lexical simplicity of this objective of “seeing”

hides a very complex reality which very often people are tricked into. Even known researchers of

the field have fallen into the trap: The anecdote that Marvin Minsky, an Artificial Intelligence

pioneer from MIT, assigned to solve the computer vision problem as a summer project to a

degree student back in the sixties is an illustrative and well-known example. We still lack the

complete understanding of how our own human perception works and computers still cannot

see. But we have also made some great advances in the field too. One such example is the

field of Structure from Motion (SfM) [17] which takes as input the 2D motion from images

and seeks to infer in a totally automated manner the 3D structure of the scene viewed and the

camera locations where the images where captured. Both SfM from computer vision and the

Figure 1.1 Depiction of robot vision in popular fiction. The above scenes are from the popular movies series“Terminator”. Though it may seem quite natural for robots to interpret scenes, road traffic and avoid obstacle,it has ended up being one of the hardest problems for Artificial Intelligence.

1

Simultaneous Localization and Mapping (SLAM) [55, 2, 33] in mobile robotics research does

the same job of estimating sensor motion and structure of an unknown static environment.

One of the important motivation behind this is to estimate the 3D scene structure and camera

motion from an image sequence in realtime so as to help guide robots.

However almost all existing works on vision based SLAM has a big assumption about en-

vironment: that it should be static. This thesis is about my efforts towards extending visual

SLAM/SfM to dynamic environments containing multiple moving objects. The objective is

to push the boundaries of current SLAM/SFM literature which are based on a static world

assumption, to obtain the 3D structure and camera trajectory w.r.t. both static and moving

objects in a environment.

1.1 Background and Motivation

SLAM involves simultaneously estimating locations of newly perceived landmarks and the

location of the robot itself while incrementally building a map of an unknown environment.

Over the last decade, SLAM has been one of the most active research fields in robotics and

excellent results have been reported by many researchers [55, 2, 54]; predominantly using laser

range-finder sensors to build 2-D maps of planar environments. Though accurate, laser range-

finders are expensive and bulky, so lot of researchers turned to cameras which provide low-cost,

full 3-D and much richer intuitive “human-like” information about the environment. So, last

decade also saw a significant development in vision based SLAM systems [11, 33, 36, 21].

1.1.1 Why SLAM?

We humans have also been successfully practising SLAM mostly unconsciously. While ob-

serving the environment through our eyes, our brain combines the observations with significant

assumptions and prior knowledge – reports our location and the nature of the surroundings,

both quantitatively and qualitatively. For instance, the localisation question might be answered

at home or running down a street on at about 10 Kmph. The mapping result might be an ab-

stract conception of a floor plan, or a set of topological relationships between places of interest.

For the task of robotic or vehicular navigation, the most useful localisation and mapping output

is geometric in nature where pose and structure estimates are represented in specific coordinate

systems and parameterisations. In other words, we need to understand what is really happening

in three dimensions, and a complete reconstruction of the 3D geometry of the scene becomes

almost inevitable.

Accurate and reliable SLAM is thus crucial for autonomy of robot in unknown surroundings.

Even if the autonomous navigation component of the SLAM platform is passive, as in the case

2

Figure 1.2 A comparison of reactive vs model bsed approach for a simple task of robot cleaning.Model based apprach provides far more advantages, and thus illustrates the use of SLAM whichallows for buliding such a model. Illustration courtsey ETH Zurich Robotics Lab.

of a hand-held camera or a tele-operated robot, or when it is merely following a human or

another robot, it gives a richer quantitative knowledge of its motion and the environment. The

accrued map is also useful as a persistent description of the explored environment and also an

efficient tool for more higher level tasks like path planning or navigation. To illustrate this

point, lets take an example of case, where a robot needs to clean an indoor room (see Fig. 1.2).

One approach is to ask the robot to start cleaning and move straight till it reaches an obstacle,

where it may choose a different direction. This is termed as reactive agent. But this is far

from efficient, and does not answer some important questions like: Where is the robot at any

given time?; Can we guarantee that the whole room will be cleaned?; Which direction to move

next? What to if it needs to do recharge or refill?. The other approach is a model based

approach, where the robot makes use of a model/representation of the room, which can be used

for efficient planning of the robot trajectory and also to find its way to recharging dock when

finished. Such an model of the environment can be obtained from SLAM.

1.1.2 Why Monocular Vision?

This is a very crucial and common question, because a significant effort in unmanned vehicle

systems has been using LIDARS, IMU, GPS and stereo cameras. There are some obvious

disadvantages of these expensive hardware for e.g . systems with too many moving parts, two

cameras equals double image processing; a fragile structure means endless calibrations; its

inability to accurately measure large distances leads to its impossibility to consider remote

objects. At large distances, the views of both cameras are exactly the same, and a stereo bench

provides exactly the same information as one single camera. Performing SLAM with a single

video camera, while an attractive prospect, adds its own particular difficulties to the already

considerable general challenges of the problem. However the added study and intelligence

needed to make this feasible increases the robustness of the system, for cases of disaster when a

single camera is left working of all the sensors. I am not against usage of additional hardware,

3

but we should make an effort to extract as much information from mono vision as possible.

This way we will always get the advantages, and none of the drawbacks. Fig. 1.3 illustrates

some of the reasons in support of study of monocular robot vision. Similar observations has

also been mentioned in Chapter 1 of [49].

Figure 1.3 Why Monocular? Left: A snapshot of Car Racing game which provides theplayer with a monocular image of virtual reality. This gives us an idea of the scope of monocularsystems. Middle: An autonomous vehicle from MIT fitted with more than five expensiveLIDARS and IMUs. Right: An illustration of the need for monocular capability for robustrobots!

1.1.3 Why Multibody Visual SLAM?

Vision based SLAM [11, 21, 33, 36, 32, 51] and SfM systems [17, 15] have been the subject of

much investigation and research. But almost all these approaches assume a static environment,

containing only rigid, non-moving objects. Moving objects are treated the same way as outliers

and filtered out using robust statistics like RANSAC [16]. Though this may be a feasible

solution in less dynamic environments, but it soon fails as the environment becomes more

and more dynamic. Also accounting for both the static and moving objects provides richer

information about the environment. For solving the problem of autonomous robot/vehicle in

dynamic environment, we will have to detect and keep track of those other moving objects, try

to identify or at least have some description of them, and reasonable good information on their

positions and velocities. A robust solution to the SLAM problem in dynamic environments

will expand the potential for robotic applications, especially in applications which are in close

proximity to human beings and other robots. As put by [60], robots will be able to work not

only for people but also with people.

1.2 Related work and Contributions

The last decade saw lot of developments in the “multibody” extension [40, 44, 58] to multi-

view geometry. These methods are natural generalization of classical structure from motion

4

theory [17, 15] to the challenging case of dynamic scenes involving multiple rigid-body motions.

Thus given a set of feature trajectories belonging to different independently moving bodies,

multibody SfM estimates the number of moving objects in the scene, cluster the trajectories

on basis of motion, and then estimate the model as in relative camera pose and 3D structure

w.r.t.each body. However all of them have focused only on theoretical and mathematical aspects

of the problem and have experimented on very short sequences, with either manually extracted

or noise-free feature trajectories. Also the high computation cost, frequent non-convergence of

the solutions and highly demanding assumptions; all have prevented them from being applied to

real-world sequences. Only recently Ozden et al . [37] discussed some of the practical issues, that

comes up in multibody SfM. In contrast, we propose a multibody visual SLAM system, which is

a realtime, incremental adaptation of the multibody SfM. However the proposed framework still

offers the flexibility of choosing the objects that needs to be reconstructed. Objects, not chosen

for reconstruction are simply tracked. This is helpful, since certain applications may just need

to know the presence of moving objects rather than its full 3D structure or there may not be

enough computational resource for realtime reconstruction of all moving objects in the scene.

The proposed system is a tightly coupled integration of various modules of feature tracking,

motion segmentation, visual SLAM, and moving object tracking while exploring various feed-

backs in between these modules. Fig. 1.4 illustrates system pipeline and outputs of each different

modules.

Reconstructing 3D trajectory of a moving point from monocular camera is ill-posed: it is

impossible, without making some assumptions about the way it moves. However object motions

are not random, and can be parameterised by different motion models. Typical assumptions

have been either that a point moves along a line or a conic or on a plane [1] or more recently

as a linear combination of basis trajectories [38]. Target tracking from bearings-only sensors

(which is also the case for a monocular camera) has also been studied extensively in “Bearings-

only Tracking” (BOT) literature [5, 24] where statistical filters seems to be the method of

choice. This same monocular observability problem gives rise to the so called “relative scale

problem” [12, 37] in multibody SfM. In other words since each independently moving body has

its 3D structure and camera motion estimated in its own scale, it results in a one-parameter

family of possible, relative trajectories per moving object w.r.t.static world. This needs to

resolved for a realistic, unified reconstruction of the static and moving parts together. Ozden

et al . [12] exploited the increased coupling between camera and object translations that tends

to appear at false scales and the resulting non-accidentalness of object trajectory. However

their approach is mostly batch processing, wherein trajectory data over time is reconstructed

for all possible scales, and the trajectory which for say is most planar is chosen by the virtue

of it being unlikely to occur accidentally. Instead, we take a different approach by making use

5

of a particle filter based bearing only tracker to estimate the correct scale and the associated

uncertainty (see Sec. 6.1).

In realtime visual SLAM systems, moving objects have not yet been dealt properly. In [62],

a 3D model based tracker runs parallel with the MonoSLAM [11] for tracking a previously

modeled moving object. This prevents the visual SLAM framework from incorporating moving

features lying on that moving object. But the proposed approach does not perform moving

object detection; so moving features apart from those lying on the tracked moving object can

still corrupt the SLAM estimation. Sola [49] does an observability analysis of detecting and

tracking moving objects with monocular vision. To bypass the observability issues with mono-

vision, he proposes a BiCamSLAM [49] solution with stereo cameras. A similar stereo solution

has also been proposed recently by [27].The work by Migliore et al . [31] maintains two separate

filters: a MonoSLAM filter [11] with the static features and a BOT for the moving features.

All these methods [27, 49, 62] have a common framework in which a single filtering based

SLAM [11] on static parts is combined with moving object tracking (MOT), which is often

termed as SLAMMOT [27]. Unlike SLAMMOT, we adopted multibody SfM kind approach

where multiple moving objects are also fully reconstructed simultaneously, but our framework

still allows simple tracking if full 3D structure estimation of moving object is not needed.

As concluded in Sec.4.2 of [38] and also in BOT literature, dynamic reconstruction with

mono-vision is good only when object and camera motion are non-correlated. To avoid this,

existing methods resorted to spiral camera motions [27], multiple photographers [38] or uncor-

related camera-object motion [12]. We do not have any restrictive assumptions on the camera

motion or environment. Instead, we extract more information from reconstruction pipeline in

form of cues, which are then used to constrain the uncertainty in moving object reconstruction.

We also do not have any restrictive assumptions on the camera motion or environment, and

thus works for difficult case of smoothly moving cameras, wherein object and camera motion

are highly correlated.

1.3 Thesis Overview and layout

1.3.1 Problem Statement

We propose a realtime incremental multibody visual SLAM algorithm. The final system

integrates feature tracking, motion segmentation, visual SLAM and moving object tracking.

We introduce several feedback paths among these modules, which enables them to mutually

benefit each other. The input to the system is image stream from a single moving monocular

camera. And we need to produce the following outputs in realtime:

6

Figure 1.4 The input to our system is monocular image sequence. Various modules of feature tracking,motion segmentation, visual SLAM and Moving Object tracking are interleaved and running online. The finalresult is an integrated dynamic map of the scene including 3D structure and 3D trajectory of the camera, staticworld and moving objects.

• Moving Object Detection.

• 3D reconstruction of static world points.

• 3D reconstruction of moving objects.

• 6DOF Camera trajectory in 3D.

• 6DOF Moving object trajectory in 3D.

• Integrated Dynamic 3D map of the environment.

1.3.2 System Overview

Fig. 1.4 illustrates system pipeline and outputs of each different modules. The feature

tracking module tracks existing feature points, while new features are instantiated. The purpose

of the motion segmentation module is to segment these feature tracks belonging to different

motion bodies, and to maintain this segmentation as new frames arrives. In the initialization

step, an algebraic multibody motion segmentation algorithm is used to segment the scene into

multiple rigidly moving objects. A decision is made as to which objects will be undergoing

the full 3D structure and camera motion estimation. The background object is always chosen

to undergo the full 3D reconstruction and camera motion estimation process. Other objects

may either undergo full SfM estimation or just simply tracked, depending on the suitability for

SfM estimation or application demand. On the objects, chosen for reconstruction, the standard

monocular visual SLAM pipeline is used to obtain the 3D structure and camera pose relative to

that object. For these objects, we compute a probabilistic likelihood that a feature is moving

along or moving independently of that object. These probabilities are recursively updated as the

7

Figure 1.5 The input to our system is monocular image sequence. Output is a realtime multibody recon-struction of the scene. This is snapshot of the video results attached with this thesis.

features are tracked. Also the probabilities take care of uncertainty in pose estimation by the

visual SLAM module. Features with less likelihood of fitting one model are either mismatched

features arising due to tracking error or features belonging to either some other reconstructed

object or one of the unmodeled independently moving objects. For the unmodeled moving

objects, we use spatial proximity and motion coherence to cluster the residual feature tracks

into independently moving entities.

The individual modules of feature tracking, motion segmentation and visual SLAM are

tightly coupled and various feedback paths in between them are explored, which benefits each

other. The motion model of a reconstructed object estimated from the visual SLAM module

helps in improving the feature tracking. Relative camera pose estimates from SLAM are used

by motion segmentation module to compute probabilistic model-fitness. The uncertainty in

camera pose estimate is also propagated into this computation, so as to yield robust model-

fitness scores. The computation of the 3D structure also helps in setting a tighter bound in

the geometric constraints, which results in more accurate independent motion detection.These

results from the motion segmentation are fed back to the visual SLAM module. The motion

segmentation prevents independent motion from corrupting the structure and motion estimation

by the visual SLAM module. This also ensures a less number of outliers in the reconstruction

process of a particular object. So we need less number of RANSAC iterations [16] thus resulting

in improved speed in the visual SLAM module. We then describe motion cues coming from SfM

estimate done on that moving object and several geometric cues imposing constraints on possible

8

depth and velocities, made possible due to reconstruction of static world. Integration of multiple

cues reduces immensely the space of possible trajectories and provides an online solution for

the relative scale problem. This enables a unified representation of the scene containing 3D

structure of static world, moving objects, 3D trajectory of the camera and moving objects along

with associated uncertainty.

1.3.3 Thesis layout

This thesis can be visualized as the amalgamation of four different modules, namely

• Feature Tracking

• Motion detection and segmentation

• Visual SLAM

• Moving Object Tracking

In Chap. 2, we briefly present the feature tracking methodology tested and being used for the

system. Then, we discuss the multibody motion segmentation and detection framework in

Chap. 3. We explain the geometric constraints used for detecting independent motion and also

present a robust probability framework for th same. This is followed by the discussion of our

visual SLAM framework in Chap. 4 using efficient Lie Group theory. We present the particle

filter based Moving Object Tracking in Chap. 5. We also present various cues from the visual

SLAM module and the process of integrating them to tracking framework. Finally in Chap. 6,

we put everything together to build a unified map of the dynamic scene. We present the relative

scale problem, and how the cues from Visual SLAM and moving object tracking module are

used to overcome this problem. In this chapter, we also show the final results of the proposed

multibody reconstruction system on multiple real image datasets.

9

Chapter 2

Feature Tracking

The feature tracking module tracks existing feature points, while new features are instanti-

ated. It is an important sub-module that needs to be improved for multibody visual SLAM to

take place. Contrary to conventional SLAM, where the features belonging to moving objects

are not important, we need to pay extra caution to feature tracking for multibody SLAM. For

multibody visual SLAM to take place, we should be able to get feature tracks on the moving

bodies also. This is challenging as different bodies are moving at different speeds. Also 3D

reconstruction is only possible, when there are sufficient feature tracks of a particular body.

However, relaxing the feature matching threshold also invites more mismatches. This increase

in outliers can even break the robust motion segmentation, or lead it to wrong convergence.

The tracking module is interleaved with motion segmentation and Visual SLAM, which allows

it to get benefited from these modules.

A short overview of the feature tracking methodology adopted by us is as follows. In each

image, a number of salient features are detected while ensuring the features are sufficiently

spread all over the image. Contrary to conventional visual SLAM, new features are added

almost every frame. However only a subset of these, detected on certain keyframes are made

into 3D points. The extra set of tracks helps in detecting independent motion. In order

to preserve feature tracks belonging to independent motions, we do not perform restrictive

matching initially. Instead the feature matching is performed in two stages. In the 1st stage,

features are matched over a large range so as to allow matches belonging to moving objects.

A preliminary segmentation and motion estimate is made using this coarse matching. Finally

when the camera motion estimate is available, we resort to guided matching which yields a

larger number of features. In this stage, we make full use of the camera motion knowledge,

while matching features.

10

2.1 Feature Detectors and Descriptors

Feature tracking is traditionally achieved by a combination of feature detector and descriptor.

Feature detectors identify salient points, also called keypoints in an image. While descriptors

determine a numerical description of these, usually as a signature vector describing the local

neighborhood around the keypoint. With each image we independently detect features and

match them across images with the help of the descriptors computed over these features. The

search space can be restricted to features inside a window around original feature location in

the 2nd image. Another (less popular now-a-days) approach for getting feature tracks can is

by tracking the features around a small window around original location of the feature in the

image as in KLT [56, 46] or block matching based approaches. Feature detectors normally

look for distinct salient locations which can be localized distinctively and are repeatable along

the sequence. There are several alternatives like Harris corners, Good Features to Track [46],

SIFT [29], SURF [4] and FAST corners [41]. FAST [41] learns a decision tree which is turned

into C code to yield an extremely fast corner detector which also has good repeatability. The

downside being that features cluster at edges. One important post-processing after feature

detection is to ensure a uniform spread of the image features across whole image, as required for

robust relative motion computation. This can enforced locally in FAST corners using maximal

suppression [41], and globally using a simple quadtree. With quadtree representation of the

image, each cell is required to have a minimum number of features to achieve the require spread

of features across the image.

Descriptors like SIFT [29], SURF [4] and BRIEF [8] produces a signature which is quite

invariant to viewing direction, scale (distance), and minor illumination changes. Another choice

specifically useful for visual SLAM systems is to use intensity information of simple patch of 8x8

pixels around the keypoint, and warping it to take into account the viewpoint change between

the patch’s first observation with the current predicted estimate of the camera orientation and

position. This is not invariant to scale, the effect of which can be mitigated by computing

patches at different pyramidal levels of the image. The same is also effective for KLT based

tracking. This whole process is explained in more detail in Klein et al . [21]. Choice of the

detector + descriptor scheme is quite crucial for proper feature tracking. Often the best choice

varies for each dataset, and depends on number of factors like frame rate(baseline), number

of features required, image quality and computation resource available. Some of the good

choices in our opinion are SIFT+SIFT, FAST+BRIEF, FAST + “warped image patch” and

FAST/SURF+SURF. So we do alternate between different detector + descriptor combinations

mentioned above depending on the current dataset. But however in most of our experiments,

we have found FAST corners along with “warped image patch” as the best combination.

11

2.2 Feature Matching Constraints

Using descriptors or the warped image patches, the matching procedure boils down to a

nearest neighbors search. So a correct correspondence will be very close in descriptor space.

For warped image patches, we compute a zero-mean SSD scores which some resilience to lighting

changes. For standard descriptors like SIFT and SURF, an eucledian distance metric works

quite well. Its is also common to use other metrics like Hamming distance as in [8]. Interleaving

of the feature tracking module with the Visual SLAM module, provides us a couple of other

constraints which can be used to significantly improve the feature matching. They are discussed

next:

a) Adaptive Search Window: Between a pair of image, features are matched within a

fixed distance (window) from its location in one image. The size and shape of this window is

decided adaptively, based on the past motion of that particular body. For 3D points, whose

depth has been computed from the visual SLAM module, the 1D epipolar search is reduced to

just around the projection of the 3D point on the image with predicted camera pose.

b) Warp matrix for patch: When, we are going by the FAST + “warped image patch”

scheme, we can make use of the camera position estimate from the visual SLAM module to

apply an affine warp on the image patches to maintain view invariance from the patch’s first

and current observation. If the depth of a patch is unknown, only a rotation warp is made. For

the image patch of the 3D points, which have been triangulated, a full affine warp is performed.

This process is exactly same as the patch search procedure in Klein et al . [21].

c) Occlusion Constraint: Motion segmentation gives rough occlusion information, i.e.it

says whether some foreground moving object is occluding some other body. This information

helps in data association, particularly for features belonging to a background body, which are

predicted to lie inside the convex hull created from the feature points of a foreground moving

object. These occluded features are not associated, and are kept until they emerge out from

occlusion.

d) Backward Match and Unicity Constraint: When a match is found, we try to

match that feature backward in the original image. Matches, in which each point is the

other’s strongest match is kept. Enforcing unicity constraint amounts to keeping only the

single strongest, out of several matches for a single feature in the other image.

2.3 Dense Feature Matching

Apart from the above strategy of sparse local feature based approaches, we can also obtain

feature correspondence between images through dense global energy based methods. This is

becoming more popular [34, 57] and feasible these days, because of a marked improvement

12

in state of art optical flow methods [7, 53, 63] both in terms of speed and accuracy. Due

to the filling-in effect, optical flow provides a dense flow field, and thus a huge amount of

correspondences, that can increase the robustness of the motion estimation process for scenes

with too many texture-less surfaces like indoors with big white walls. Newcombe et al . [34]

uses GPU based implementation of optical flow [63] for dense realtime 3D reconstruction. Also

recently [57] evaluated the use of dense optical flow against standard sparse feature tracking

for computation of fundamental matrix.

For our system of multibody motion segmentation and reconstruction, extra dense corre-

spondence available can be used for obtaining a dense segmentation of moving objects in the

scene. Some of the selected features from these dense correspondence, based on the distinctive-

ness score can be used for the motion estimation module. This will also help reconstruction of

small moving objects, which suffer from dearth of enough proper features.

13

Chapter 3

Moving Object Detection and Segmentation

In this chapter, we present an incremental motion segmentation framework that can segment

feature points belonging to different motions and maintain the segmentation with time. The

solution to the moving object detection and segmentation problem will act as a bridge between

the static SLAM or SfM and its counterpart for dynamic environments. But, motion detection

from a freely moving monocular camera is an ill-posed problem and a difficult task. The

moving camera causes every pixel to appear moving. The apparent pixel motion of points

is a combined effect of the camera motion, independent object motion, scene structure and

camera perspective effects. Different views resulting from the camera motion are connected by

a number of multiview geometric constraints. These constraints can be used for the motion

detection task. Those inconsistent with the constraints can be labeled as moving or outliers.

Apart from the multibody motion segementation framework, we also present an algorithm

for independently moving object detection. This is the special case of multibody segmentation

framework, where-in only the static background is only used for structure and motion estima-

tion. Anything moving independently is detected as moving objects. We also present the case

when a monocular camera is mounted on a robot, and its odometry is being used for estimating

egomotion rather than the Visual SLAM routine.

3.1 Related Works

The problem of motion detection and segmentation from a moving camera has been a very

active research area in computer vision community. The multiview geometric constraints used

for motion detection, can be loosely divided into four categories. The first category of methods

used for the task of motion detection, relies on estimating a global parametric motion model

of the background. These methods [20, 39, 61] compensate camera motion by 2D homography

or affine motion model and pixels consistent with the estimated model are assumed to be

14

background and outliers to the model are defined as moving regions. However, these models

are approximations which hold only for certain restricted cases of camera motion and scene

structure.

The problems with 2D homography methods led to plane-parallax [18, 43, 66] based con-

straints. The “planar-parallax” constraints represents the scene structure by a residual displace-

ment field termed parallax with respect to a 3D reference plane in the scene. The plane-parallax

constraint was designed to detect residual motion as an after-step of 2D homography methods.

They are designed to detect motion regions when dense correspondences between small baseline

camera motions are available. Also, all the planar-parallax methods are ineffective when the

scene cannot be approximated by a plane.

Though the planar-parallax decomposition can be used for egomotion estimation [19] and

structure recovery [42], the traditional multi-view geometry constrains like epipolar constraint

in 2 views or trillinear constraints in 3 views and their extension to N views have proved to

be much more effective in scene understanding as in SfM or visual SLAM. This constraints are

well understood and are now textbook materials [15, 30, 17].

3.2 Requirements

A robust moving object detection and segmentation algorithm is fundamental for a efficient

SLAM in a dynamic environment. A successful algorithm for motion detection and segmentation

to aid visual SLAM should ideally satisfy the following requirements:

a) Incremental solution: We need a incremental solution, where the moving objects are

detected and segmented as early as possible while the segmentation hypothesis gets updated

with new frames. This is fundamental to visual SLAM which demands an incremental solution

as future frames are not available. We use 2-view motion segmentation which is then incremen-

tally extended to multiple views as the new frames arrive. In contrast to batch processing with

fixed number of images sequences, the proposed approach allows detection of objects moving

at different relative speeds. The incremental nature of the solution also allows the individual

modules of motion segmentation, feature tracking and visual SLAM to be interleaved, which in

turn allows them to get benefited from one another.

b) No restrictive assumptions: Many of the moving object detection has restricted

assumptions on scene structure, camera models or camera motion. For e.g. the methods

of [18, 28, 66] assumes a planar scene, whereas [39, 20, 61] considers an affine camera model.

However these assumptions are often invalid in real environments. The proposed approach has

no assumption about the scene structure and considers a full perspective camera model.

15

c) Seamless integration with existing VSLAM/SfM solutions: The existing meth-

ods for moving object detection avoids computation of scene structure or camera egomotion.

For e.g. the parallax rigidity constraint of [18] or modeling of the static background as in [20];

all performs computations which cannot be used for scene structure or camera egomotion esti-

mation. The proposed motion segmentation approach makes use of epipolar geometry, which

forms the backbone of the standard visual SLAM methods and thus avoids extra computations

and can be easily integrated with the existing SfM solutions.

d)Ability to handle both 2D and 3D points where depth is known: There are two

kinds of feature points in the system: 2D points which are yet to be triangulated and 3D points

which have been triangulated by visual SLAM module. The knowledge of depth adds additional

constraints to be used by the motion segmentation module.

e)Ability to handle degenerate motions: Detecting independently moving objects be-

come tough, when camera motion and the motion of the moving object is along the same

direction. The features belonging to moving object then moves along the epipolar line and

thus the epipolar constraint is not able to detect it. These set of motions called degenerate

motions [66] are very common in real world like a camera following another car moving along

the road, or a robot mounted camera following a moving person. Standard visual SLAM sys-

tems detects outliers with help the of reprojection error, which is somewhat equivalent to the

measure of epipolar distance. So they are not able to detect degenerate motions.

3.3 Overview

The input to the motion segmentation framework is feature tracks from feature tracking

module, the camera relative motion in reference to each reconstructed body from the visual

SLAM module, and the previous segmentation. The motion segmentation module needs to

verify the existing segmentation, and also associate new features to one of the moving objects.

As new frames arrives, the number of independently moving objects changes as objects enters

or leaves the scene, part of existing object splits to move independently or the reverse case of

two independent motions merging. So the motion segmentation framework needs to detect the

change in the number of moving objects and update accordingly.

The task of the motion segmentation module is that of model selection so as to assign these

feature tracks to one of the reconstructed bodies or some unmodeled independent motion. Effi-

cient geometric constraints are used to form a probabilistic fitness score for each reconstructed

object. With each new frame, existing features are tested for model-fitness and unexplained

features are assigned to one of the independently moving object. But before all this, we should

initialize the motion segmentation, which is described next.

16

3.4 Initialization of Motion Segmentation

The initialization routine for motion segmentation and visual SLAM is somewhat different

from rest of the algorithm. We make use of an algebraic two-view multibody motion segmen-

tation algorithm of RAS [40] to segment the input set of feature trajectories into multiple

moving objects. The reasons behind the choice of [40] among other algorithms is its direct

non-iterative nature and faster computation time. This segmentation provides the system, the

choice of motion bodies for reconstruction. For the segment chosen for reconstruction, an ini-

tial 3D structure and camera motion is computed via epipolar geometry estimation as part of

static-scene visual SLAM initialization routine.

3.5 Geometric Constraints

Between any two frames, the camera motion with respect to the reconstructed body is

obtained from the visual SLAM module. The geometric constraints are then estimated to detect

independent motion with respect to the reconstructed body. So for the static background, all

moving objects should be detected as independent motion.

For a camera moving relative to a scene, the fundamental matrix is given by F = [Kt]×KRK−1

where K is the intrinsic matrix of the camera and R, t is the rotation and translation of the

camera between two views.

3.5.1 Epipolar Constraint

Epipolar constraint is the commonly used constraint that connects two views. Epipolar

constraint is best explained through fundamental matrix [17]. The fundamental matrix is a

relationship between any two images of a same scene that constrains where the projection of

points from the scene can occur in both images. It is a 3x3 matrix of rank 2 that encapsulates

camera’s intrinsic parameters and the relative pose of the two cameras. Reprojection error or

its first order approximation called Sampson error, based on the epipolar constraint is used

throughout the structure and motion estimation by the visual SLAM module. Basically they

measure how far a feature lies from the epipolar line induced by the corresponding feature in

the other view. Though these are the gold standard cost functions for 3D reconstruction, it

is not good enough for independent motion detection. If a 3D point moves along the epipolar

plane formed by the two views, its projection in the image move along the epipolar line. Thus

in spite of moving independently, it still satisfies the epipolar constraint. This is depicted in

Fig. 3.1.

17

Figure 3.1 Left: The world point P moves non-degenerately to P′

and hence x′, the image of

P′does not lie on the epipolar line corresponding to x. Right: The point P moves degenerately

in the epipolar plane to P′. Hence, despite moving, its image point lies on the epipolar line

corresponding to the image of P.

Let pn and pn+1 be the images of some 3D point, X in a pair of images In, In+1 ob-

tained at time instants tn and tn+1. Let Fn+1,n be the fundamental matrix relating the two

images In, In+1, with In as the reference view. Then epipolar constraint is represented by

pTn+1Fn+1,npn = 0 [17]. The epipolar line in In+1, corresponding to pn is ln+1 = Fn+1,npn. If

the 3D point is static then pn+1 should ideally lie in ln+1. But if a point is not static, the

perpendicular distance from pn+1 to the epipolar line ln+1, depi is a measure of how much the

the point deviates from epipolar line. If the coefficients of the line vector ln+1 are normalized,

then depi = |ln+1 · pn+1|. However, when a 3D point moves along the epipolar plane, formed

with the two camera centers and the point P itself, the image of P still lies on the epipolar line.

So the epipolar constraint is not sufficient for degenerate motion. Fig. 3.1 shows the epipolar

geometry for non-degenerate and degenerate motions.

This kind of degenerate motion is quite common in real world scenarios, e.g. camera and

an object are moving in same direction as in camera mounted in car moving through a road,

or camera-mounted robot following behind a moving person. To detect degenerate motion, we

make use of the knowledge of camera motion and 3D structure to estimate a bound in the

position of the feature along the epipolar line. We describe this as Flow Vector Bound (FVB)

constraint.

3.5.2 Flow Vector Bound (FVB) Constraint

For a general camera motion involving both rotation and translation R, t, the effect of

rotation can be compensated by applying a projective transformation to the first image. This is

achieved by multiplying feature points in view1 with the infinite homography H = KRK−1 [17].

The resulting feature flow vector connecting feature position in view2 to that of the rotation

18

compensated feature position in view1, should lie along the epipolar lines. Now assume that

our camera translates by t and pn, pn+1 be the image of a static point X. Here pn is normalized

as pn = (u, v, 1)T . Attaching the world frame to the camera center of the 1st view, the camera

matrix for the views are K[I|0] and K[I|t]. Also, if z is depth of the scene point X, then

inhomogeneous coordinates of X is zK−1pn. Now image of X in the 2nd view, pn+1 = K[I|t]X.

Solving we get, [17]

pn+1 = pn +Kt

z(3.1)

Equation 3.1 describes the movement of the feature point in the image. Starting at point pn

in In it moves along the line defined by pn and epipole, en+1 = Kt. The extent of movement

depends on translation t and inverse depth z. From eq. 3.1, if we know depth z of a scene

point, we can predict the position of its image along the epipolar line. In absence of any depth

information, we set a possible bound in depth of a scene point as viewed from the camera. Let

zmax and zmin be the upper and lower bound on possible depth of a scene point. We then find

image displacements along the epipolar line, dmin and dmax, corresponding to zmax and zmin

respectively. If the flow vector of a feature, does not lie between dmin and dmax, it is more likely

to be an image of an independent motion.

The structure estimation from visual SLAM module helps in reducing the possible bound in

depth. Instead of setting zmax to infinity, known depth of the background enables in setting a

more tight bound, and thus better detection of degenerate motion. The depth bound is adjusted

on the basis of depth distribution along the particular frustum.

The probability of satisfying flow vector bound constraint P (FV B) can be computed as

P (FV B) =1

1 +

(FV − dmean

drange

)2β(3.2)

Here dmean =dmin + dmax

2and drange =

dmax − dmin2

, where dmin and dmax are the bound

in image displacements. The distribution function is similar to a Butterworth bandpass filter.

P (FV B) has a high value if the feature lies inside the bound given by FVB constraint, and

the probability falls rapidly as the feature moves away from the bound. Larger the value of β,

more rapidly it falls. In our implementation, we use β = 10.

3.6 Independent Motion Probability

In this section we describe a recursive formulation based on Bayes filter to derive the proba-

bility of a projected image point of a world point being classified as stationary or dynamic.The

relative pose estimation noise and image pixel noise are bundled into a Gaussian probability

19

distribution of the epipolar lines as derived in [17] and denoted by ELi = N (µli,∑

li), where

ELi refers to the set of epipolar lines corresponding to image point i, and N (µli,∑

li) refers to

the standard Gaussian probability distribution over this set.

Let pni be the ith point in image In. The probability that pn

i is classified as stationary

is denoted as P (pni|In, In−1) = Pn,s(p

i) or Pn,si in short, where the suffix s signifying static.

Then with Markov approximation, the recursive probability update of a point being stationary

given a set of images can be derived as

P (pni|In+1, In, In−1) = ηs

iPn+1,siPn,s

i (3.3)

Here ηsi is normalization constant that ensures the probabilities sum to one.

The term Pn,si can be modeled to incorporate the distribution of the epipolar lines ELi.

Given an image point pn−1i in In−1 and its corresponding point pn

i in In then the epipolar line

that passes through pni is determined as ln

i = en × pni. The probability distribution of the

feature point being stationary or moving due to epipolar constraint is defines as

PEP,si =

1√2π|Σl|

exp(−1

2(ln

i − µni)τΣ−1l (ln

i − µni)) (3.4)

However this does not take into account the misclassification arising due to degenerate mo-

tion explained in previous sections. To overcome this, the eventual probability is fused as a

combination of epipolar and flow vector bound constraints:

Pn,si = α · PEP,si + (1− α) · PFV B,si (3.5)

where, α balances the weight of each constraint. A χ2 test is performed to detect if the epipolar

line lni due to the image point is satisfying the epipolar constraint. When Epipolar constraint

is not satisfied, α takes a value close to 1 rendering the FVB probability inconsequential. As

the epipolar line lni begins indicating a strong likelihood of satisfying epipolar constraint, the

role of FVB constraint is given more importance, which can help detect the degenerate cases.

An analogous set of equations characterize the probability of an image point being dynamic,

which are not delineated here due to brevity of space. In our implementation, the envelope

of epipolar lines [17] is generated by a set of F matrices distributed around the mean R, t

transformation between two frames as estimated by visual SLAM module. Hence a set of

epipolar lines corresponding to those matrices are generated and characterized by the sample

set, ELssi =

(l1i, l2

i.......lqi)

and the associated probability set, PEL =(wl1

i, wl2i.......wlq

i)

where each wlji is the probability of that line belonging to the sample set ELss

i computed

through usual Gaussian procedures. Then the probability that an image point pni is static is

given by:

Pn,si =

q∑j=1

αj · PEP,ljiS · pni + (1− αj) · PFV B,lji

S · pni · wlj i (3.6)

20

where, PEP,ljiS and PFV B,lji

S are the probabilities of the point being stationary due to the

respective constraints with respect to the epipolar line lji.

3.7 Clustering Unmodeled Motions

Features with high probabilities of being dynamic are either outliers or belongs to potential

moving objects. Since these objects are often small, and highly dynamic, they are very hard to

be reconstructed. So instead we adopt a simple move-in-unison model for them. Spatial prox-

imity and motion coherence is used to cluster these feature tracks into independently moving

entities. By motion coherence, we use the heuristic that the variance in the distance between

features belonging to same object should change slowly in comparison.

3.8 Computation of Fundamental Matrix from Odometry

If our camera is mounted on robot, we can make use of the easily available robot odometry

to get the relative rotation and translation of the camera between a pair of captured images.

It is also common to fuse both this with the egmotion information obtained from visual SLAM

module for better accuracy. In our experiments with indoor robots (Pioneer P3DX), we have

found robot odometry alone was good enough for our task. Also since, we only make use of

relative pose information between a pair of views; the incrementally growing odometry error

does not creep into the system. The following two sections discuss the main issues that come

up, when camera motion is estimated from odometry.

3.8.1 Synchronization

To correctly estimate the camera motion between a pair of frames, it is important to have

correct odometry information of the robot at the same instant when a frame is grabbed by the

camera. However the images and odometry information are obtained from independent channels

and are not synchronized with each other. For firewire cameras, accurate timestamp for each

captured image can be easily obtained. Odometry information from the robot is stored against

time, and then interpolating between them, we can find where the robot was at a particular

point in time. Thus the synchronization is achieved by interpolating the robot odometry to the

timestamp of the images obtained from the camera.

21

3.8.2 Robot-Camera Calibration

The robot motion is transformed to the camera frame to get the camera motion between

two views. The transformation between the robot to camera frame was obtained through a

calibration process similar to the Procedure A described in [59]. A calibration object such as

a chess board is used and a coordinate frame fixed to it. The transformation of this frame to

the world frame is known and described as TWO , where O refers to the object frame and W the

world frame. Also known are the transformation of the frame fixed to the robot center with

the world frame, TWR and the transformation from camera frame to object frame, TOC , obtained

through the usual extrinsic calibration routines. Then the transformation of the camera frame

with the robot frame is obtained as TRC = TRWTWO TOC . If the transformation of the calibration

object from the world frame is not easily measurable, the mobility of the robot can be used for

the calibration. The calibration in that case will be similar to the hand-eye calibration [52, 59].

3.8.3 Preventing Odometry Noise

The top-left and top-right images of figure set 3.2 shows a feature of a static point tracked

between the two images. The feature is highlighted by a red dot. The bottom figure of fig. 3.2

depicts a set of epipolar lines in green generated for this tracked feature as a consequence of

modeling noise in camera egomotion estimation as described in section 3.6. The mean epipolar

line is shown in red. Since the features are away from the mean line they are prone to be

misclassified as dynamic in the absence of a probabilistic framework. However as they lie on

one of the green lines that is close to the mean line their probability of being classified as

stationary is more than being classified as dynamic. This probability increases in subsequent

images through the recursive Bayes filter update if they come closer to the mean epipolar line

while lying on one of the set of lines. It is to be noted that an artificial error was induced in

robot motion for the sake of better illustration. Also note that the two frames are separated

by relatively large baseline. In general the stationary points do not deviate as much as shown

in the bottom figure of Fig. 3.2.

3.9 Results of Moving Object Detection

This section shows results of the motion segmentation presented in this chapter. The system

has been tested on a number of real image datasets, with varying number and type of moving

entities. However we postpone most of the multibody motion segmentation results to Sec. 6.4.

In this chapter we concentrate more on moving object detetion from a camera mounted robot.

22

Figure 3.2 Left: A stationary feature shown in red. Middle: The same feature tracked ina subsequent image. Right: Though, the feature is away from the mean epipolar lines due toodometry noise, it still lies on one of the lines in the set.

Supplementary Video 1: A video showing detection of multiple people and other moving

objects while the robot moves and maneuvers around obstacles.

3.9.1 Robot mounted Camera Sequence

We show experimental results on various test scenarios on a ActivMedia Pioneer-P3DX

Mobile Robot. A single IEEE 1394 firewire camera (Videre MDCS2) mounted on the robot

was the only sensor used for the experiment. Images of resolution 320×240 captured at 30Hz

was processed on a standard onboard laptop.

Fig. 3.3 depicts a typical degenerate motion, being detected by the system. The left and

right figures of the top row shows the P3DX moving behind another robot, called MAX in our

lab. The salient features are shown in red. The left figure of the mid row shows the flow vectors

in yellow. The red dot at the tip of the yellow line is akin to an arrow-head indicating the

direction of the flow. The right figure of the mid row shows epipolar lines in gray. It also shows,

that the flow vectors on MAX move towards the epipole while the flow vectors of stationary

features move away from it. The left figure of the bottom row shows the features classified as

moving, which are marked with green dots. All the features classified as moving lies on the

MAX, as expected. The bottom right figure highlights the moving regions in green shade, which

is made by forming a convex hull from the cluster of moving features.

Fig. 3.4 depicts motion detection when the robot is simultaneously performing both rotation

and translation. Images in the top row show images grabbed during two instants separated by

30 frames, as a person moves before a rotating while translating camera. The left figure in

middle row shows the flow vectors, while the right figure in the middle row shows the epipolar

lines in gray and perpendicular distances of features from their expected (mean) epipolar lines

in cyan. Longer cyan lines indicate a feature is having a greater perpendicular distance from

the epipolar line. The left figure in bottom row depicts the features classified as moving in

23

Figure 3.3 Top Left: An image with stationary objects and a moving robot, MAX, aheadof the P3DX. The KLT features shown in red. Top Right: A subsequent image where MAXhas moved further away. Mid Left: The flow vectors shown in yellow. Middle Right: Theflow vectors of stationary features moves away from epipole, while MAX’s flow vectors movescloser to the epipole. Bottom Left: Image with only the dynamic features in green. BottomRight: Convex hull in green overlaid over the motion regions.

24

Figure 3.4 Top Left: An image with stationary objects and a moving person as the P3DXrotates while translating. The KLT features are shown in red. Top Right: A subsequentimage after further rotation and translation Middle Left: The flow vectors shown in yellow.Middle Right: Flow vectors in yellow, epipolar lines in gray and perpendicular distancesin cyan. Bottom Left: Features classified as dynamic, shown in green. Bottom Right:Convex hull in green overlaid over motion regions.

25

green as they all lie on the moving person. The right figure of the bottom row shows the convex

hull in green formed from the clustered moving features, as it gets overlaid on the person.

3.9.2 Handheld Indoor Lab Sequence

This is an indoor sequence taken from an inexpensive hand-held camera. As the camera

moves around, moving persons enter and leave the scene. Fig. 3.5 shows the results for this

sequence. The bottom right picture in Fig. 3.5 shows how two spatially close independent

motions is clustered correctly by the algorithm. This sequence also involves a lot of degenerate

motion as the camera and the persons move in same direction. The 3D structure estimation of

the background helps in setting a tighter bound in the FVB constraint. The depth bound is

adjusted on the basis of depth distribution of the reconstructed background along the particular

frustum, as explained in Sec. 3.5.2.

Figure 3.5 Results from the Indoor Lab Sequence

26

3.9.3 Detection of Degenerate Motions

Fig. 3.6 shows an example of degenerate motion detection, as the flow vectors on the moving

person almost move along epipolar lines, but they are being detected due to usage the FVB con-

straint. This results verifies system’s performance for arbitrary camera trajectory, degenerate

motion and changing number of moving entities.

Figure 3.6 Epipolar lines in Grey, flow vectors after rotation compensation is shown in orange.Cyan lines show the distance to epipolar line. Features detected as independently moving areshown as red dots. Note the near-degenerate independent motion in the middle and right image.However the use of FVB constraint enables efficient detection of degenerate motion.

3.9.4 Person detection

Some applications demand people to be explicitly detected from other moving objects. We

use “part-based” representations [67, 64] for person detection. The advantage of the part-based

approach is that it relies on body parts and therefore it is much more robust to partial occlu-

sions than the standard approach considering the whole person. We model our implementation

as described in [67]. Haar-feature based cascade classifiers was used to detect different human

body parts, namely upper body, lower body, full body and head and shoulders. These detectors

often leads to many false alarms and missed detections. Bottom-left image of Fig. 3.7 depicts

the false detections, by this individual detectors. A probabilistic combination [67] of these indi-

vidual detectors gives a more robust person detector. But running four Haar-like-feature based

detectors on the whole image takes about 400ms, which is very high for realtime implemen-

tation. We use knowledge of motion regions as detected by our method, to reduce the search

space of part detectors. This greatly reduces the computations and the time taken is mostly

less than 40ms. Also the detections have less false positives.

27

Figure 3.7 TOP LEFT: A scene involving a moving toy car and person from the indoorsequence. TOP RIGHT: Detected moving regions are overlaid in green. BOTTOM LEFT:Haar classifier based body part detectors. BOTTOM RIGHT: Person detected by part-basedperson detection over image regions detected as moving.

28

Chapter 4

Visual SLAM Framework

Both SfM from computer vision and SLAM in mobile robotics research does the same job of

estimating sensor motion and structure of an unknown static environment. Performing SLAM

with a single video camera, while an attractive prospect, adds its own particular difficulties

to the already considerable general challenges of the problem. In this chapter we put forward

the visual SLAM module used in our system. For each body/object chosen for reconstruction,

the visual SLAM module computes the structure of that body and also the camera trajectory

w.r.t. that body. SfM has been studied by scientists for last three decades, and most of the

mathemetical theories are now text book material [17, 15]. However practical SfM and visual

SLAM systems has come up only in the last decade, and lot of research is yet to happen

4.1 Related Works

Visual SLAM methods can be roughly categorized by two different approaches. The filtering

approaches of [11, 9, 49] recursively updates the state vector consisting of probability distribu-

tions over features and camera pose parameters. They employ filters like EKF or particle filter to

sequentially fuse measurement from all images. The second set of approaches [36, 32, 21, 47, 22]

are the real-time and incremental version of the standard batch SfM. To achieve real-time per-

formance, bundle adjustment style optimization is performed only over small number of past

frames selected through sliding window [36, 32, 47], or spatially distributed keyframes [21, 22].

The filter based approaches conventionally builds a very sparse map (about 10-30 features per

frame) of high quality features. Also they do not make use of any robust statistics to reject

outliers. Whereas the keyframe/bundle adjustment methods extracts as much correspondence

information as possible and typically uses some robust statistics like RANSAC to eliminate

outliers. A detailed comparison between these two approaches can be found in [50].

29

4.2 Visual SLAM Formulation

Visual SLAM or SfM estimates the camera pose denoted as gtCW and map points XW ∈ R3

with respect to a certain world frame W , at a time instant t. The structure coordinates XW are

assumed to be constant i.e.static in this world frame evident from the absence of time t in its

notation. In the multibody VSLAM scenario, the world frame W can be either the static world

frame S or a rigid moving object O, which has been chosen for reconstruction. The 4×4 matrix

gCW contains a rotation and a translation and transforms a map point from world coordinate

frame to camera-centred frame C by the equation XC = gCWXW . It belongs to the Lie group

of Special Euclidean transformations, SE(3). The tangent space of an element of SE(3) is

its corresponding Lie algebra se(3), so any rigid transformation is minimally parameterised as

a 6-vector in the tangent space of the identity element. We denote this minimal 6-vector as

ξ := (vTωT )T ∈ R6, where the first three elements is an axis-angle representation of rotation,

while the later three represents the translation. The ξ ∈ R6 represents the twist coordinates for

the twist matrix ξ ∈ se(3). Thus a particular twist is a linear combination of the generators of

the SE(3) group, i.e.

ξ =6∑i=1

ξiGi =

[ω v

0 0

]| ω ∈ so(3), v ∈ R3 (4.1)

Here ξi are individual elements of ξ and Gi are the 4 × 4 generator matrices which forms the

basis for the tangent space to SE(3). And ω is a skew-symmetric matrix obtained from the

3-vector ω. The exponential map exp : se(3)→ SE(3) maps a twist matrix to its corresponding

transformation matrix in SE(3) and can be computed efficiently in closed form. Changes in the

camera pose gCW is obtained by pre-multiplying with a 4× 4 transformation matrix in SE(3).

Thus the camera pose evolves with time as:

gt+1CW = ∆gtgtCW = exp(ξ)gtCW (4.2)

The world points XW are first transformed to camera frame and then projected in the image

plane using a calibrated camera projection model CamProj(.). This defines our measurement

function z as:

z =

(u

v

)= CamProj(gCWXW ) (4.3)

In each visual SLAM, the state vector x consists of a set of camera poses and reconstructed 3D

world points. The optimization process aims at better state vector x by iteratively improving it

so that it minimizes a sum of square errors between some prediction and observed data z. The

incremental updates in optimization are calculated as in Eq. 4.2 at the tangent space around

identity se(3) and mapped back onto manifold. This enables minimal representation during

optimization and avoids singularities. Also the Jacobians of the above equations needed in the

30

optimization process can be readily obtained in closed form. Due to this advantages, the Lie

theory based representation of rigid body motion is becoming popular among recent VSLAM

solutions [23, 51]. We again use this Lie group formulation in tracking of the moving object

described in Chapter 5.

The monocular visual SLAM framework is that of a standard bundle adjustment visual

SLAM [21, 32, 51]. A 5-point algorithm with RANSAC is used to estimate the initial epipolar

geometry, and subsequent pose is determined by camera resection. Some of the frames are

selected as keyframes, which are used to triangulate 3D points. The set of 3D points and the

corresponding keyframes are used in by the bundle adjustment process to iteratively minimize

reprojection error. The bundle adjustment is initially performed over the most recent keyframes,

before attempting a global optimization. Our implementation closely follows to that of [21, 32].

While one thread performs tasks like camera pose estimation, keyframe decision and addition,

another back-end thread optimizes this estimate by bundle adjustment. But there are couple of

important differences with the existing SLAM methods, namely its interplay with the motion

segmentation, bearing only object and feature tracking module, reconstruction of small moving

objects. They are discussed next.

4.3 Feedback from Motion Segmentation

The motion segmentation prevents independent motion from entering the VSLAM computa-

tion, which could have otherwise resulted in incorrect initial SfM estimate and lead the bundle

adjustment to converge to local minima. The feedback results in less number of outliers in the

SfM process of a particular object. Thus the SfM estimate is better conditioned and less number

of RANSAC iterations is needed. Apart from improvement in the camera motion estimate, the

knowledge of the independent foreground objects coming from motion segmentation helps in

the data association of the features, which is currently being occluded by that object. For the

foreground independent motions, we form a convex-hull around the tracked points clustered as

an independently moving entity. Existing 3D points lying inside this region is marked as not

visible and is not searched for a match. This prevents 3D features from unnecessary deletion

and re-initialization, just because it was occluded by an independent motion for some time.

4.4 Dealing Degenerate Configurations

In dynamic scenes, moving objects are often small compared to the field of view, and often

appear planar or has very less perspective effects. Then both relative pose estimation and

camera resection faces ambiguity and results in significant instability. During relative pose

31

estimation from two views, coplanar world points can cause at most a two-fold ambiguity. So

we use 5-point algorithm from 3 views to resolve this planar degeneracy, exactly as described

in [35]. Though theoretically, calibrated camera resection from a coplanar set of points has a

unique solution unlike its uncalibrated counterpart, it still suffers from ambiguity and instability

as shown in [45]. So for seemingly small and planar objects we modified the EPnP code as in

Sec. 3.4 of [26] to initialize the resection process, which is then refined by bundle adjustment.

32

Chapter 5

Moving Object Tracking

A monocular camera is a projective sensor that only provides the bearing information of the

scene. So moving object tracking with mono-vision is a bearings-only tracking (BOT) which

aims to estimate the state of a moving target comprising of its 3D position and velocity. A

single BOT filter is employed on each independently moving objects. At any time instant t,

the camera only observes the bearing of tracked feature on the moving object. We consider the

moving object state vector as gtOS ∈ SE(3), representing 3D rigid body transformation of the

moving object O in the static world frame S. Through visual SLAM on the static body, we

already know the camera pose gtCS ∈ SE(3). Due to inherent non-linearity and observability

issues, particle filter has been the preferred approach [5] for BOT. In this chapter we develop

a formulation of the particle filter based BOT that integrates multiple cues from static world

reconstruction.

We start with the simple BOT framework in the absence of any cues. Reconstruction of

static world provides various cues which helps in constraining the moving object’s depth and

velocity. Sec. 5.1.3 describes how those constraints are integrated as tracker iterates through

time.

5.1 Particle Filter based BOT

The uncertainty in pose of the object is represented by the poses of set of particles giS

and their associated weights. Each particle’s state denoted by gtiS ∈ SE(3) represents its pose

w.r.t.S at a time instant t. We continue with Lie group preliminaries discussed in Chap. 4. We

assume an instantaneous constant velocity (CV) motion model, which is considered the best

bet and most generic for modeling an unknown motion. Mean velocity between two intervals

is represented by the mean twist matrix˜ξti = 1

∆t ln(gti(gt−1i )−1), where

˜ξ ∈ se(3) is the mean

twist matrix associated with the mean six dimensional velocity vector ξ ∈ R6. The motion

33

model of the particle then generates samples according to the pdf i.e.probability distribution

function p(gt+1iS |gtiS , ξti). Each component of the mean velocity vector has a Gaussian error with

a standard deviation σj , j ∈ {1, . . . 6}. To transform this Gaussian distribution in R6 to SE(3)

space the following procedure is used. We define a vector α ∈ R6, whose each component αj

is sampled from the Gaussian N (0, σ2j ), then α is the twist matrix associated with αj . Then

ξti = eα˜ξti generates samples in the twist matrix space of R4X4 corresponding to the Gaussian

errors centred at the mean velocity vector. Then the dynamic model of the particle generates

samples that approximate the pdf given before as

gt+1iS = exp(α) exp(

˜ξti∆t)g

tiS (5.1)

The measurement model that predicts the location of a particle with SE(3) pose gt+1i in the

image as

zt+1i =

(ut+1i

vt+1i

)= CamProj(Trans(gt+1

CS gt+1i )) (5.2)

Here Trans(.) operator extracts the translation vector associated with the SE(3) pose of the

particle and CamProj(.) is the camera projection Eq. 4.3. The weight wi of the particle is

updated as wt+1i = 1√

(2π)ηexp( (z−zi)′(z−zi)

2η2), where z is the actual image coordinate of the

feature being tracked. The particles then undergo resampling in the usual particle filter way:

particles with a higher weight have higher probability of getting resampled.

5.1.1 Ground Plane, Depth Bound & Size Bound

The structure estimation of the static world from visual SLAM module helps in reducing the

possible bound in depth. Instead of setting the maximum depth to infinity, known depth of the

background allows to limit the depth of a foreground moving object. The depth bound (DB)

is adjusted on the basis of depth distribution of static world map points along the particular

frustum of the ray. This bound gets updated as the camera moves around in the static world.

The 3D point cloud of the static world is used to estimate the ground-plane (GP). Using the

fact that most real world objects move over the ground plane, we can add constraints to the

velocity vector such that its height above the ground plane is constant. Both the above cues

ignored that we are able to track multiple features of the object. At wrong depths, this points

may be reconstructed to lie below the ground-plane or too much above it. This criteria of

size and unrealistic reconstructions is used to get an additional depth constraint. All these

cues constraints the possible depth or velocity space. Integration of these depth and velocity

constraints into the BOT filter is discussed in Sec. 5.1.3.

34

5.1.2 Initialization

Initialization is an important step for performance of particle filter in BOT. For a moving

object which enters the scene for the first time, particles are initialized all along the ray starting

from the camera through the image point which is the projection of a point on the dynamic

object being considered. A uniform sampling is then used to initialize the particles at various

depths inside that bound [dmin, dmax] computed from the depth bound cue described previously

in Sec. 5.1.1. The velocity components are initialized in a similar manner. At each depth,

number of particles with various velocities are uniformly sampled so that the speeds lie inside a

predetermined range [smin, smax] along all possible directions. When a previously static object

starts moving independently, we can do better initialization than uniform sampling: we initialize

the depth as normal distribution N (d, σ2), where d is the depth estimate obtained from the

point’s reconstruction as part of the original body.

5.1.3 Integrating Depth and Velocity Constraints

Depth and velocity constraints play a very important role in improving the tracker perfor-

mance, even in scenarios which are otherwise unobservable for a bearing only tracker. This

reduces the space of state vectors to some constrained set of state vectors denoted as ψ. This

can be implemented as the motion model, by sampling from a truncated density function ps,

defined as:

ps =

p(gt+1iS |gtiS , ξti) gt+1

iS ∈ ψ0 otherwise

(5.3)

Here, non-truncated pdf over motion model, p(gt+1iS |gtiS , ξti) is evaluated from Eq. 5.1. To draw

samples from this truncated distribution, we use rejection sampling over the distribution, until

the condition giS ∈ ψ is satisfied. This method of rejection sampling is sometimes inefficient.

So in our implementation, we restrict the number of trials and if it still does not lie inside ψ,

we flag those particles for lower weight in the measurement update step.

35

Chapter 6

Unification: putting evrything together

In this chapter we discuss, how all the different modules of feature tracking, motion segmen-

tation, visual SLAM and moving object tracking are put together and used to build an unified

multibody reconstruction of the dynamic scene. We aim to build a unified 3D map of the

dynamic world which changes in time, and thus provides information about the moving objects

apart from the static world. In this chapter, we primarily discuss the most important problem

namely the “Relative Scale problem” that hinders such unified multibody reconstruction from

a monocular vision. We also present the final results of our multibody reconstruction system

in this chapter.

6.1 Relative Scale Problem

Performing visual SLAM on the moving object, we obtain the gtCO ∈ SE(3) and object

points XO ∈ R3 with respect to the object frame O. Now we also obtain the camera pose gCS

in the static world frame S. Thus configuration of the moving object O w.r.t.static world S

can be obtained as gOS = g−1COgCS . Expanding this equation in the homogenous representation

we obtain: [ROS tOS

0 1

]=

[RTCS −RT

CStCS

0 1

][RCO tCO

0 1

](6.1)

Equating the rotation and translation parts of Eq. 6.1, we obtain ROS = RTCSRCO & tOS =

RTCStCO − RT

CStCS . We can obtain the exact ROS , but from monocular SfM, we can only

obtain tCO and tCS up to some unknown scales [17]. We can fix the scale for tCS , i.e.for the

static background as 1, and denote the scale for tCO as the unknown relative scale parameter

s. Then the trajectory of the moving object is 1-parameter family of possible trajectories given

by

tOS = sRTCStCO −RT

CStCS (6.2)

36

All of these trajectories satisfy the image observations, i.e.projection of the world points on the

moving object are same for all the above trajectories. This is a direct consequence of the depth

unobservability problem of monocular camera. Thus even after reconstructing a moving car, we

are not able to say whether it is a toy car moving in front of the camera, or a standard car moving

over the road. So we need to estimate this relative scale, and only when the estimated scale is

close to the true scale, the reconstruction will be meaningful. Similar to bearing only tracking

of a moving point from a monocular camera, it is impossible to estimate the true scale, without

any assumptions about the way it moves. Ozden et al . [12] exploited the increased coupling

between camera and object translations that tends to appear at false scales and the resulting

non-accidentalness of object trajectory. However their approach is mostly batch processing,

wherein trajectory data over time is reconstructed for all possible scales, and the trajectory

which for say is most planar is chosen by the virtue of it being unlikely to occur accidentally.

Also the method will only work, when we are able to reconstruct the moving object.

Unlike Ozden et al . [12], we take a different approach by employing the particle filter based

BOT on a point of the moving object to solve the relative scale problem. The state of the moving

object (i.e.position and velocity) and the associated uncertainty is continuously estimated by

the tracker and is completely represented by its set of particles. The mean of the particles

is thus the best estimate of the moving point from the filtering point of view and with the

assumptions (state transition model) made in design of the filter. When the BOT is able to

estimate the depth of a moving point upto a reasonable certainty, we can use this depth to fix

the relative scale, and get a realistic multibody reconstruction. Apart from the online nature of

the solution, the BOT can also estimate the state of an object, for which reconstruction is not

possible. Denoting the posterior depth estimate as obtained by BOT of a point on the moving

object by dBOT , and dSFM as depth of the same point as computed by the visual SLAM on

that object. The map points XO and camera poses gCO are scaled by s = dBOT /dSFM , before

being added to the integrated map.

6.2 Feedback from SfM to BOT

For the objects chosen for reconstruction, a successful reconstruction of the moving object

from the visual SLAM module can help to improve the bearing only tracking (BOT). As de-

scribed in Sec. 6.1, there exist a 1-parameter family of possible solutions for the trajectory of

a moving point. Let dSFM denote the depth of the tracked moving point from the camera in

the object frame, and diS be the depth of ith particle from the camera pose in the static world

frame. Using Eq. 6.2, the

tiS = siRTCStCO −RT

CStCS (6.3)

37

where si = diS/dSFM . Thus for a particle at particular depth, SfM on the moving object gives

a unique estimate of the particle translation. This information can be used during measure-

ment update, and also to set the motion model for the next state transition. Thus when SfM

estimates are available this can act as a secondary observation. The observation function is

then given by Eq. 6.3. The measurement update computes a distance measure between the

particle positions estimated from Eq. 6.3 and the predicted position of the particle by motion

model. Thus particles having different velocity than that estimated by the SfM, but still lying

on the projected ray can now be assigned lower weights or rejected. For the particles which

survived the resampling after this measurement update, the motion models of the particles are

set in accordance with that estimated by Eq. 6.3. Let the twist matrix corresponding to this

transformation estimate given by SfM for a particle i be denoted as ξti,SFM . The particle i is

then sampled based on the motion model given by the pdf p(gt+1iS |gtiS , ξti,SFM ), which essentially

generates a particle with mean

gt+1iS = exp(ξti,SFM∆t)gtiS (6.4)

Between two views, SfM estimate obtained from Visual SLAM module reduces the set of pos-

sible trajectories from all possible trajectories lying along the two projection rays, to an one-

parameter family of trajectories as given by Eq. 6.2.

6.3 Dynamic 3D Occupancy Map

The output of our multibody visual SLAM system can be used to generate stochastic 3D

occupancy maps, which can be used for several applications like robot navigation or path-

planning. Solving the relative scale problem enables us to create a dynamic map of the scene

containing structure and trajectory of both static and moving objects. Using the current state

and uncertainty estimate of a moving object as given by BOT module, we can find its current

position with certain probability and also predict the most likely space to be occupied by the

object in the next instant. To realize this module, we have made use of OctoMap [65], a prob-

abilistic 3D volumetric mapping library. While creating this volumetric map, we assume that

the camera poses are perfect, so the underlying math is same as discussed in Chapter 9 of [55].

With each sensor observation, we update both occupancy and non-occupancy probability of a

voxel. Whenever a new 3D feature has been triangulated and added to the list of 3D pointcloud,

a new range ray measurement with depth/range equivalent to the depth of this 3D feature is

added to the probabilistic occupancy map. All voxels till the end of this ray is considered as

unoccupied. The probabilistic mapping may eat a lot of computation, but it is fundamental to

any robust solution for post mapping work.

38

6.4 Multibody Reconstruction Results

The system has been tested on a number of publicly available real image datasets with

varying number and type of moving entities. Details of the image sequences used for experiments

are listed in Table. 6.1. The system is implemented as threaded processes in C++. The open

source libraries of TooN, OpenCV and SBA (for bundle adjustment) are used throughout the

system. Runtime of the algorithm depends on lot of factors like the number of bodies being

reconstructed, total number of independent motions being tracked by the BOT, image resolution

and bundle adjustment rules. The system runs in realtime at the average of 14Hz in a standard

laptop (Intel Core i7) compared to 1 minute per frame of [37], with up to two moving objects

being simultaneously tracked and reconstructed.

Dataset Image Resolution Trajectory Length Avg. Runtime

Moving Box [62] 320x240 718 images 20Hz

Versailles Rond [10] 760x578 700 images (400m) 7Hz

New College [48] 512x384 1500 images 13Hz

CamVid [6] 480x360 (Resized) 1600 images (0.7km) 11Hz

Table 6.1 Details of the datasets

Note: The legends (Fig. 6.1) used here and in the accompanying video are sligtly different

from figures shown in the main paper. Motion segmentation results are shown by shade in

correponding color of the convex-hull formed of the feature points segmented as independently

moving. Reconstructed 3D static world points are colored depending on height from the es-

timated ground-plane. Trajectories and structure of other moving objects are shown in some

color( red/blue). Particles of the BOT filter are shown in green. The images are best viewed

on screen.

6.4.1 Camvid Sequence

We tested our system on some dynamic parts of the CamVid dataset [6]. This a road sequence

involving a camera mounted on a moving car. The results on this sequence is highlighted in

Fig. 6.2. It shows the camera trajectory and 3D structure of static background. Reconstruction

and the 3D trajectory of a moving car in the scene as produced by the system are also shown.

Note the high degree of correlation between camera and the car trajectory, which makes it

challenging for both motion segmentation and relative scale estimation.

39

Figure 6.1 Legends used in the figures

Figure 6.2 Results on the CamVid dataset. The top image shows output of motion segmentation. Thebottom left image shows the reconstruction of the static world and a moving car at certain instant. Particlesof the BOT are shown in green and the trajectory of camera is colored red. Bottom right image also shows theestimated 3D trajectory of the moving car.

40

6.4.2 New College Sequence

We tested our results on some dynamic parts of the New College dataset [48]. Only left of the

stereo image pairs has been used. In this sequence, the camera moves along a roughly circular

campus path, and three moving persons passes by the scene. The results on this sequence are

highlighted in Fig. 6.3. It shows the map and camera trajectory with respect to the static

world and the final depth estimate from BOT. It is to be noted that, this along with most of

the sequences shown in this experiments are generally unobservable. It is only after integration

of different cues, we obtain a descent estimate of the moving object location.

Figure 6.3 Results on the New college dataset sub-sequence. The top image shows the outputof motion segmentation. The bottom image shows, the reconstructed map of the static worldand the final estimate of the position of the three moving persons detected. Particles of theBOT are shown in green.

6.4.3 Versailles Rond Sequence

Versailles Rond Sequence: This is an urban outdoor sequence [10] taken from a fast

moving car, with multiple numbers of moving objects appearing and leaving the scene. Only

left of the stereo image pairs has been used. Fig. 6.4 shows the results of the integrated map

produced by the algorithm. The middle image shows an instance of the online occupancy map,

consisting of the 3D reconstruction of two moving cars, corresponding BOT tracker and most

41

likely occupancy map of the moving objects in next instant. Bottom of Fig. 6.4 demonstrate

the reconstructed trajectory of two moving cars, shown in red and blue.

6.4.4 Moving Box Sequence

This is same sequence as used in [62]. A previously static box is being moved in front of the

camera which is also moving arbitrarily. However unlike [62], our method does not use any 3D

model, and thus can work for any previously unseen object. As shown in Fig. 6.5 our algorithm

reliably detects the moving object just on the basis of motion constraints. However, the fore-

ground moving box is nearly white and thus provides very less features for reconstruction. This

sequence also highlights the detection of previously static moving objects. Upon detection, 3D

map points lying on the moving box are deleted and their 3D coordinates are used to initialize

the BOT as described in Sec. 5.1.2.

6.5 Discussion

We have shown results for multibody visual SLAM under unobservable motion, degenerate

motion, arbitrary camera trajectory and changing number of moving entities. This is made

possible even in unobservable cases, by the integrating multiple cues from the reconstruction

pipeline. Also the algorithm is online (causal) in nature and also scales to arbitrary long

sequences.

6.5.1 Comparison of different cues to BOT

Comparison of different cues to BOT: Fig. 6.6 shows improvement in bearing only

tracking for different cues. Left graph shows the depth variance obtained for a moving car in

CamVid sequence. Since it is only tracked through BOT, no SfM cue is available. Whereas the

right figure compares the performance for 3rd moving car in Versilles Rond sequence. As seen

in Fig. 6.6, feedback from SfM has the highest effect in decreasing the uncertainty among all

cues. For a particular particle of the BOT filter, The ground-plane (GP) constraints possible

velocites to lie parallel to the plane. Whereas SfM cue to BOT restricts it to an unique velocity

vector for each depth of the particle. Depth and Size bounds can perform well even for highly

correlated motions.

6.5.2 Smooth Camera Motions

Moving object tracking from a smoothly moving camera is very challenging. It becomes

unobservable for a naive BOT, and results in very high correlation and thus rendering the

42

Figure 6.4 Results on the Versailles Rond Sequence. Top image samples some segmentation results from thesequence. The middle image shows an instance of the online occupancy map. Shaded region shows the mostlikely space to be occupied in next 16 frames (around 1s). Bottom image demonstrates the reconstruction andtrajectories of two moving cars.

43

Figure 6.5 Results for the Moving Box sequence

Figure 6.6 Comparison of different cues to the BOT namely Depth Bound (DB), Ground Plane (GP) andSfM feedback.

44

methods of [38, 12] unsuitable. Left of Fig. 6.7 shows the trajectory of the 5th moving car

of Versailles Rond sequence for three different scales. Contrary to [12], the trajectories at

wrong scales does not show any accidentalness or violation of heading constraint, which proves

its ineffectiveness for relative scale esimation from smoothly moving cameras. Typical road

scenes also involves frequent degenerate motions, making them hard even for detection. Right

image of Fig. 6.7 shows an example of degenerate motion detection, as the flow vectors on

the moving person almost move along epipolar lines, but they are being detected due to usage

the FVB constraint (Sec. 3) which gets improved by incorporating feedback from static world

reconstruction.

Figure 6.7 LEFT: Moving object trajectory for three different scales of 0.04, 0.11 and 0.18, where 0.11 (red)being the correct scale. RIGHT: Degenerate motion detection. Epipolar lines in Grey, flow vectors after rotationcompensation is shown in orange. Cyan lines show the distance to epipolar line. Moving Features detected areshown as red dots.

45

Chapter 7

Conclusion and Future Directions

7.1 Conclusion

In this thesis, we worked towards a practical vision based Simultaneous Localization and

Mapping (SLAM) system for a highly dynamic environment. Knowledge of moving objects in

the scene is of paramount importance for any autonomous mobile robot working on real world

scenerios. However as discussed in related works, there has been very little progress in that

regard. We presented a multibody visual SLAM system which is an adaptation of multibody

SfM theory in similar lines as visual SLAM is for standard offline batch SfM theory. We were

able to obtain a fast incremental multibody reconstruction across long real-world sequences.

Many real world dynamic scenes involving smoothly moving monocular camera and scenes with

high correlation between camera and moving object motion (e.g road scenes involving moving

cars), which are considered unobservable even in the latest state-of-the-art systems, can now

be reconstructed with reasonable accuracy. We introduce a novel geometric constraint in two

views, capable of detecting moving objects followed by a moving camera in same direction, a

so-called degenerate configuration where the commonly used epipolar constraint fails. This is

made possible by exploiting the knowledge of camera motion to estimate a bound in image

feature position along the epipolar line. A probability framework propagates the uncertainties

in the system and recursively updates the probability of a feature being stationary or dynamic.

The different modules of motion segmentation, visual SLAM and moving object tracking were

integrated and we presented, how each module helps the other one. We present a particle filter

based BOT algorithm, which integrates multiple cues from the reconstruction pipeline. The

integrated system can simultaneously perform realtime multibody visual SLAM, tracking of

multiple moving objects and unified representation of them using only a single monocular cam-

era. The work presented here can find immediate applications in various robotics applications

involving dynamic scenes.

46

7.2 Future Work

I feel that the work on “multibody Visual SLAM” is still in its nascent stages and a lot of

work needs to be done in this area by the robot vision community. I hope to move in that

direction during my PhD studies. Some of the immediate improvents that can be applied to

the system are as the following.

In this thesis, we have limited ourselves to multiview geometric constraints. This cues are

complementary to semantic information like appearance, object detection/categorization cues

which can be exploited for more richer and better information about the scene. For e.g .if we

can perform object detection over this, more complex or object specific motion models (e.g

non-holonomic constraints for car) can be used. Also object categories can be used for a-priori

known object size which can reduce other uncertainities.

Ess et al . [14, 13] describes a mobile vision system based on a stereo camera which makes use

of appearance-based object detection in a tracking-by-detection framework to track multiple

pedestrians in a highly dynamic and challenging environment. Also recently, [3] combined

semantic information with traditional SfM to obtain a better description of the environment.

One important extension is a more elegant and compact representation of the whole multi-

body Visual SLAM problem in a graphical model framework. A proper model, will allow us to

integrate different modules and cues in more elegant and efficient manner and can also benefit

from recent advances in inferencing algorithms from machine learning community.

There is also a need for improvement in reconstruction of small moving objects like car

or person. Standard feature tracking often does not provide enough feature tracks for proper

reconstruction. An adaptive model based tracking like [25] can lead to a better results.

47

Bibliography

[1] S. Avidan and A. Shashua. Trajectory triangulation: 3D reconstruction of moving points

from a monocular image sequence. PAMI, 22(4):348–357, 2002.

[2] T. Bailey and H. Durrant-Whyte. Simultaneous localization and mapping (SLAM): Part

II. IEEE Robotics & Automation Magazine, 13(3):108–117, 2006.

[3] S. Bao and S. Savarese. Semantic structure from motion. CVPR, 2011.

[4] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. pages 404–417,

2006.

[5] T. Brehard and J. Le Cadre. Hierarchical particle filter for bearings-only tracking. IEEE

TAES, 43(4):1567–1585, 2008.

[6] G. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-

definition ground truth database. PRL, 30(2):88–97, 2009.

[7] T. Brox, C. Bregler, and J. Malik. Large displacement optical flow. In CVPR, 2009.

[8] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary Robust Independent

Elementary Features. In ECCV. Springer, 2010.

[9] J. Civera, A. Davison, and J. Montiel. Inverse depth parametrization for monocular SLAM.

IEEE Transactions on Robotics, 24(5):932–945, 2008.

[10] A. Comport, E. Malis, and P. Rives. Real-time Quadrifocal Visual Odometry. IJRR,

29(2-3):245, 2010.

[11] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera

SLAM. PAMI, 29(6):1052–1067, 2007.

[12] K. Egemen Ozden, K. Cornelis, L. Van Eycken, and L. Van Gool. Reconstructing 3D

trajectories of independently moving objects using generic constraints. CVIU, 96(3):453–

471, 2004.

[13] A. Ess, B. Leibe, K. Schindler, and L. V. Gool. Robust multi-person tracking from a

mobile platform. PAMI, 31(10):1831–1846, 2009.

48

[14] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Moving obstacle detection in highly

dynamic scenes. In ICRA, 2009.

[15] O. Faugeras, Q. Luong, and T. Papadopoulo. The geometry of multiple images. MIT press,

2001.

[16] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with

applications to image analysis and automated cartography. Communications of the ACM,

24(6):381–395, 1981.

[17] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge

University Press, 2004.

[18] M. Irani and P. Anandan. A unified approach to moving object detection in 2D and 3D

scenes. PAMI, 20(6):577–589, 1998.

[19] M. Irani, B. Rousso, and S. Peleg. Recovery of ego-motion using region alignment. PAMI,

19(3):268–272, 1997.

[20] B. Jung and G. Sukhatme. Real-time motion tracking from a mobile robot. International

Journal of Social Robotics, 2(1):63–78, 2010.

[21] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In

ISMAR, 2007.

[22] K. Konolige and M. Agrawal. Frameslam: From bundle adjustment to real-time visual

mapping. IEEE Transactions on Robotics, 24(5):1066–1077, 2008.

[23] J. Kwon and K. Lee. Monocular SLAM with Locally Planar Landmarks via Geometric

Rao-Blackwellized Particle Filtering on Lie Groups. In CVPR, 2010.

[24] J.-P. Le Cadre and O. Tremois. Bearings-only tracking for maneuvering sources. IEEE

TAES, 34(1):179 –193, 1998.

[25] M. Leotta and J. Mundy. Vehicle surveillance with a generic, adaptive, 3d vehicle model.

PAMI, 33(7):1457 –1469, 2011.

[26] V. Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: An accurate o (n) solution to the pnp

problem. IJCV, 81(2):155–166, 2009.

[27] K. Lin and C. Wang. Stereo-based Simultaneous Localization, Mapping and Moving Object

Tracking. In IROS, 2010.

[28] M. Lourakis, A. Argyros, and S. Orphanoudakis. Independent 3D Motion Detection Using

Residual Parallax Normal Flow. In ICCV, 1998.

[29] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal

of Computer Vision (IJCV), 60(2):91–110, 2004.

49

[30] Y. Ma, S. Soatto, and J. Kosecka. An invitation to 3-d vision: from images to geometric

models. Springer Verlag, 2004.

[31] D. Migliore, R. Rigamonti, D. Marzorati, M. Matteucci, and D. G. Sorrenti. Avoiding

moving outliers in visual SLAM by tracking moving objects. In ICRA’09 Workshop on

Safe navigation in open and dynamic environments, 2009.

[32] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real time localization

and 3d reconstruction. In CVPR, 2006.

[33] J. Neira, A. Davison, and J. Leonard. Guest editorial, special issue in visual slam. IEEE

T-RO, 24(5):929–931, 2008.

[34] R. Newcombe and A. Davison. Live dense reconstruction with a single moving camera. In

CVPR, 2010.

[35] D. Nister. An efficient solution to the five-point relative pose problem. PAMI, 26(6):756–

770, 2004.

[36] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. In CVPR, 2004.

[37] K. E. Ozden, K. Schindler, and L. V. Gool. Multibody structure-from-motion in practice.

PAMI, 32:1134–1141, 2010.

[38] H. S. Park, I. Matthews, and Y. Sheikh. 3d reconstruction of a moving point from a series

of 2d projections. In ECCV, 2010.

[39] S. Pundlik and S. Birchfield. Motion segmentation at any speed. In Proceedings of British

Machine Vision Conference (BMVC), 2006.

[40] S. Rao, A. Yang, S. Sastry, and Y. Ma. Robust Algebraic Segmentation of Mixed Rigid-

Body and Planar Motions from Two Views. IJCV, 2010.

[41] E. Rosten, R. Porter, and T. Drummond. Faster and better: A machine learning approach

to corner detection. PAMI, 32:105–119, 2010.

[42] H. Sawhney. 3D geometry from planar parallax. In Computer Vision and Pattern Recog-

nition, 1994.

[43] H. Sawhney, Y. Guo, and R. Kumar. Independent motion detection in 3D scenes. PAMI,

22(10):1191–1199, 2000.

[44] K. Schindler and D. Suter. Two-view multibody structure-and-motion with outliers

through model selection. PAMI, 28(6):983–995, 2006.

[45] G. Schweighofer and A. Pinz. Robust pose estimation from a planar target. PAMI, pages

2024–2030, 2006.

[46] J. Shi and C. Tomasi. Good features to track. In CVPR, pages 593–600, 1993.

50

[47] G. Sibley, L. Matthies, and G. Sukhatme. A Sliding Window Filter for Incremental SLAM.

Unifying Perspectives in Computational and Robot Vision, pages 103–112, 2008.

[48] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman. The new college vision and

laser data set. IJRR, 28(5):595, 2009.

[49] J. Sola. Towards visual localization, mapping and moving objects tracking by a mobile

robot: a geometric and probabilistic approach. PhD thesis, LAAS, 2007.

[50] H. Strasdat, J. Montiel, and A. Davison. Real-Time Monocular SLAM: Why Filter? In

ICRA, 2010.

[51] H. Strasdat, J. Montiel, and A. Davison. Scale Drift-Aware Large Scale Monocular SLAM.

In RSS, 2010.

[52] K. Strobl and G. Hirzinger. Optimal hand-eye calibration. In IROS, 2006.

[53] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and their principles.

In CVPR, 2010.

[54] S. Thrun. Robotic mapping: A survey. In Exploring Artificial Intelligence in the New

Millenium. Morgan Kaufmann, 2002.

[55] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. 2005. MIT Press.

[56] C. Tomasi and T. Kanade. Detection and tracking of point features. 1991.

[57] L. Valgaerts, A. Bruhn, M. Mainberger, and J. Weickert. Dense versus sparse approaches

for estimating the fundamental matrix. International Journal of Computer Vision (IJCV),

pages 1–23.

[58] R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody structure from motion.

IJCV, 68(1):7–25, 2006.

[59] C. Wang. Extrinsic calibration of a vision sensor mounted on a robot. IEEE Trans. Robotics

and Automation, 8(2):161–175, 1992.

[60] C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte. Simultaneous local-

ization, mapping and moving object tracking. IJRR, 26(9):889–916, 2007.

[61] J. Wang and E. Adelson. Layered representation for motion analysis. In CVPR, 1993.

[62] S. Wangsirpitak and D. Murray. Avoiding moving outliers in visual slam by tracking

moving objects. In ICRA, 2009.

[63] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof. Anisotropic

Huber-L1 optical flow. In Proceedings of the British Machine Vision Conference (BMVC),

September 2009.

[64] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image

by bayesian combination of edgelet part detectors. 2005.

51

[65] K. M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard. OctoMap: A

probabilistic, flexible, and compact 3D map representation for robotic systems. In ICRA

2010 Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation,

2010.

[66] C. Yuan, G. Medioni, J. Kang, and I. Cohen. Detecting motion regions in the presence

of a strong parallax from a moving camera by multiview geometric constraints. PAMI,

29(9):1627–1641, 2007.

[67] Z. Zivkovic and B. Krose. Part based people detection using 2D range data and images.

In IROS, 2007.

52

Monocular Multibody Visual SLAM - IIIT...

Documents

Transcript of Monocular Multibody Visual SLAM - IIIT...