Monocular Multibody Visual SLAM - IIIT...
Transcript of Monocular Multibody Visual SLAM - IIIT...
Monocular Multibody Visual SLAM
Thesis submitted in partial fulfillment
of the requirements for the degree of
MS by Research
in
Computer Science and Engineering
by
Abhijit Kundu
200807030
Robotics Research Lab
International Institute of Information Technology
Hyderabad - 500 032, INDIA
April 2011
Copyright c© Abhijit Kundu, 2011
All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Monocular Multibody Visual SLAM”
by Abhijit Kundu, has been carried out under our supervision and is not submitted elsewhere
for a degree.
Date Adviser: Dr. K. Madhava Krishna and Dr. C. V. Jawahar
To Robots and Humans trying to achieve Singularity
Acknowledgments
I am extremely grateful to my advisors, Dr. Madhava Krishna and Prof. C. V. Jawahar,
for their guidance, help, and encouragement. Specifically, I am for thankful to them for the
thought-provoking and insightful discussions about this work during the weekly meetings over
last two years. And on the first place, I am thankful to Dr. Krishna for helping me to join the
MS program and Robotics Lab here.
Acknowledgments are also due to all fellow students and colleagues at IIIT Hyderabad for
their ideas and comments on my research, technical discussions, and most importantly their
friendship. I feel lucky to be a member of a wonderful research community at IIIT.
Finally I would thank my parents and family for supporting me during my studies not just
here, but throughout my whole life.
v
Abstract
Vision based SLAM [11, 21, 23, 33, 51] and SfM systems [17] have been the subject of
much research and are finding applications in many areas like robotics, augmented reality, city
mapping. But almost all these approaches assume a static environment, containing only rigid,
non-moving objects. Moving objects are treated the same way as outliers and filtered out using
robust statistics like RANSAC. Though this may be a feasible solution in less dynamic environ-
ments, but it soon fails as the environment becomes more and more dynamic. Also accounting
for both the static and moving objects provides richer information about the environment. A
robust solution to the SLAM problem in dynamic environments will expand the potential for
robotic applications, like in applications which are in close proximity to human beings and other
robots. Robots will be able to work not only for people but also with people.
This thesis presents a realtime, incremental multibody visual SLAM system that allows
choosing between full 3D reconstruction or simply tracking of the moving objects. Motion re-
construction of dynamic points or objects from a monocular camera is considered very hard
due to well known problems of observability. We attempt to solve the problem with a Bearing
only Tracking (BOT) and by integrating multiple cues to avoid observability issues. The BOT
is accomplished through a particle filter, and by integrating multiple cues from the reconstruc-
tion pipeline. With the help of these cues, many real world scenarios which are considered
unobservable with a monocular camera is solved to reasonable accuracy. This enables build-
ing of a unified dynamic 3D map of scenes involving multiple moving objects. Tracking and
reconstruction is preceded by motion segmentation and detection which makes use of efficient
geometric constraints to avoid difficult degenerate motions, where objects move in the epipolar
plane. Results reported on multiple challenging real world image sequences verify the efficacy
of the proposed framework.
vi
Own Publications
[1] Abhijit Kundu, C. V. Jawahar and K. M. Krishna. Realtime Multibody Visual SLAM with
a Smoothly Moving Monocular Camera. International Conference on Computer Vision
(ICCV). 2011. (Accepted)
[2] Abhijit Kundu, K. M. Krishna and C. V. Jawahar. Realtime Motion Segmentation based
Multibody Visual SLAM. Indian Conference on Computer Vision, Graphics and Image
processing (ICVGIP). 2010. (Best Paper Award)
[3] Abhijit Kundu, C. V. Jawahar and K. M. Krishna. Realtime Moving Object Detection
from a Freely Moving Monocular Camera. IEEE International Conference on Robotics
and Biomimetics (ROBIO) . 2010.
[4] Abhijit Kundu and K. M. Krishna. Moving Object Detection by Multi-View Geometric
Techniques from a Single Camera Mounted Robot. IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS). 2009.
Publications can be downloaded from http://abhijitkundu.info/
vii
Contents
Chapter Page
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Own Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Why SLAM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Why Monocular Vision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Why Multibody Visual SLAM? . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Related work and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Overview and layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Thesis layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Feature Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Feature Detectors and Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Feature Matching Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Dense Feature Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Moving Object Detection and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Initialization of Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Geometric Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 Epipolar Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 Flow Vector Bound (FVB) Constraint . . . . . . . . . . . . . . . . . . . . 18
3.6 Independent Motion Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Clustering Unmodeled Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.8 Computation of Fundamental Matrix from Odometry . . . . . . . . . . . . . . . . 21
viii
CONTENTS ix
3.8.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.8.2 Robot-Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 223.8.3 Preventing Odometry Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.9 Results of Moving Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 223.9.1 Robot mounted Camera Sequence . . . . . . . . . . . . . . . . . . . . . . 233.9.2 Handheld Indoor Lab Sequence . . . . . . . . . . . . . . . . . . . . . . . . 263.9.3 Detection of Degenerate Motions . . . . . . . . . . . . . . . . . . . . . . . 273.9.4 Person detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Visual SLAM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Visual SLAM Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Feedback from Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Dealing Degenerate Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Moving Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.1 Particle Filter based BOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1 Ground Plane, Depth Bound & Size Bound . . . . . . . . . . . . . . . . . 345.1.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.1.3 Integrating Depth and Velocity Constraints . . . . . . . . . . . . . . . . . 35
6 Unification: putting evrything together . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1 Relative Scale Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Feedback from SfM to BOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3 Dynamic 3D Occupancy Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.4 Multibody Reconstruction Results . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4.1 Camvid Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4.2 New College Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.3 Versailles Rond Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.4 Moving Box Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.5.1 Comparison of different cues to BOT . . . . . . . . . . . . . . . . . . . . . 426.5.2 Smooth Camera Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 1
Introduction
For mobile robots to be able to work with and for people and thus operate in our everyday
environments, they need to be able to acquire knowledge through perception. In other words
they need to collect sensor measurements from which they extract meaningful information about
the scene. Vision is an extraordinarily powerful sense for this. The capability of computers to
be able to “see” has been the vision of computer vision and robot vision researchers for many
years. We humans do this effortlessly, and it often seems very straightforward as seen in many
popular science fiction (see Fig. 1.1). However this lexical simplicity of this objective of “seeing”
hides a very complex reality which very often people are tricked into. Even known researchers of
the field have fallen into the trap: The anecdote that Marvin Minsky, an Artificial Intelligence
pioneer from MIT, assigned to solve the computer vision problem as a summer project to a
degree student back in the sixties is an illustrative and well-known example. We still lack the
complete understanding of how our own human perception works and computers still cannot
see. But we have also made some great advances in the field too. One such example is the
field of Structure from Motion (SfM) [17] which takes as input the 2D motion from images
and seeks to infer in a totally automated manner the 3D structure of the scene viewed and the
camera locations where the images where captured. Both SfM from computer vision and the
Figure 1.1 Depiction of robot vision in popular fiction. The above scenes are from the popular movies series“Terminator”. Though it may seem quite natural for robots to interpret scenes, road traffic and avoid obstacle,it has ended up being one of the hardest problems for Artificial Intelligence.
1
Simultaneous Localization and Mapping (SLAM) [55, 2, 33] in mobile robotics research does
the same job of estimating sensor motion and structure of an unknown static environment.
One of the important motivation behind this is to estimate the 3D scene structure and camera
motion from an image sequence in realtime so as to help guide robots.
However almost all existing works on vision based SLAM has a big assumption about en-
vironment: that it should be static. This thesis is about my efforts towards extending visual
SLAM/SfM to dynamic environments containing multiple moving objects. The objective is
to push the boundaries of current SLAM/SFM literature which are based on a static world
assumption, to obtain the 3D structure and camera trajectory w.r.t. both static and moving
objects in a environment.
1.1 Background and Motivation
SLAM involves simultaneously estimating locations of newly perceived landmarks and the
location of the robot itself while incrementally building a map of an unknown environment.
Over the last decade, SLAM has been one of the most active research fields in robotics and
excellent results have been reported by many researchers [55, 2, 54]; predominantly using laser
range-finder sensors to build 2-D maps of planar environments. Though accurate, laser range-
finders are expensive and bulky, so lot of researchers turned to cameras which provide low-cost,
full 3-D and much richer intuitive “human-like” information about the environment. So, last
decade also saw a significant development in vision based SLAM systems [11, 33, 36, 21].
1.1.1 Why SLAM?
We humans have also been successfully practising SLAM mostly unconsciously. While ob-
serving the environment through our eyes, our brain combines the observations with significant
assumptions and prior knowledge – reports our location and the nature of the surroundings,
both quantitatively and qualitatively. For instance, the localisation question might be answered
at home or running down a street on at about 10 Kmph. The mapping result might be an ab-
stract conception of a floor plan, or a set of topological relationships between places of interest.
For the task of robotic or vehicular navigation, the most useful localisation and mapping output
is geometric in nature where pose and structure estimates are represented in specific coordinate
systems and parameterisations. In other words, we need to understand what is really happening
in three dimensions, and a complete reconstruction of the 3D geometry of the scene becomes
almost inevitable.
Accurate and reliable SLAM is thus crucial for autonomy of robot in unknown surroundings.
Even if the autonomous navigation component of the SLAM platform is passive, as in the case
2
Figure 1.2 A comparison of reactive vs model bsed approach for a simple task of robot cleaning.Model based apprach provides far more advantages, and thus illustrates the use of SLAM whichallows for buliding such a model. Illustration courtsey ETH Zurich Robotics Lab.
of a hand-held camera or a tele-operated robot, or when it is merely following a human or
another robot, it gives a richer quantitative knowledge of its motion and the environment. The
accrued map is also useful as a persistent description of the explored environment and also an
efficient tool for more higher level tasks like path planning or navigation. To illustrate this
point, lets take an example of case, where a robot needs to clean an indoor room (see Fig. 1.2).
One approach is to ask the robot to start cleaning and move straight till it reaches an obstacle,
where it may choose a different direction. This is termed as reactive agent. But this is far
from efficient, and does not answer some important questions like: Where is the robot at any
given time?; Can we guarantee that the whole room will be cleaned?; Which direction to move
next? What to if it needs to do recharge or refill?. The other approach is a model based
approach, where the robot makes use of a model/representation of the room, which can be used
for efficient planning of the robot trajectory and also to find its way to recharging dock when
finished. Such an model of the environment can be obtained from SLAM.
1.1.2 Why Monocular Vision?
This is a very crucial and common question, because a significant effort in unmanned vehicle
systems has been using LIDARS, IMU, GPS and stereo cameras. There are some obvious
disadvantages of these expensive hardware for e.g . systems with too many moving parts, two
cameras equals double image processing; a fragile structure means endless calibrations; its
inability to accurately measure large distances leads to its impossibility to consider remote
objects. At large distances, the views of both cameras are exactly the same, and a stereo bench
provides exactly the same information as one single camera. Performing SLAM with a single
video camera, while an attractive prospect, adds its own particular difficulties to the already
considerable general challenges of the problem. However the added study and intelligence
needed to make this feasible increases the robustness of the system, for cases of disaster when a
single camera is left working of all the sensors. I am not against usage of additional hardware,
3
but we should make an effort to extract as much information from mono vision as possible.
This way we will always get the advantages, and none of the drawbacks. Fig. 1.3 illustrates
some of the reasons in support of study of monocular robot vision. Similar observations has
also been mentioned in Chapter 1 of [49].
Figure 1.3 Why Monocular? Left: A snapshot of Car Racing game which provides theplayer with a monocular image of virtual reality. This gives us an idea of the scope of monocularsystems. Middle: An autonomous vehicle from MIT fitted with more than five expensiveLIDARS and IMUs. Right: An illustration of the need for monocular capability for robustrobots!
1.1.3 Why Multibody Visual SLAM?
Vision based SLAM [11, 21, 33, 36, 32, 51] and SfM systems [17, 15] have been the subject of
much investigation and research. But almost all these approaches assume a static environment,
containing only rigid, non-moving objects. Moving objects are treated the same way as outliers
and filtered out using robust statistics like RANSAC [16]. Though this may be a feasible
solution in less dynamic environments, but it soon fails as the environment becomes more
and more dynamic. Also accounting for both the static and moving objects provides richer
information about the environment. For solving the problem of autonomous robot/vehicle in
dynamic environment, we will have to detect and keep track of those other moving objects, try
to identify or at least have some description of them, and reasonable good information on their
positions and velocities. A robust solution to the SLAM problem in dynamic environments
will expand the potential for robotic applications, especially in applications which are in close
proximity to human beings and other robots. As put by [60], robots will be able to work not
only for people but also with people.
1.2 Related work and Contributions
The last decade saw lot of developments in the “multibody” extension [40, 44, 58] to multi-
view geometry. These methods are natural generalization of classical structure from motion
4
theory [17, 15] to the challenging case of dynamic scenes involving multiple rigid-body motions.
Thus given a set of feature trajectories belonging to different independently moving bodies,
multibody SfM estimates the number of moving objects in the scene, cluster the trajectories
on basis of motion, and then estimate the model as in relative camera pose and 3D structure
w.r.t.each body. However all of them have focused only on theoretical and mathematical aspects
of the problem and have experimented on very short sequences, with either manually extracted
or noise-free feature trajectories. Also the high computation cost, frequent non-convergence of
the solutions and highly demanding assumptions; all have prevented them from being applied to
real-world sequences. Only recently Ozden et al . [37] discussed some of the practical issues, that
comes up in multibody SfM. In contrast, we propose a multibody visual SLAM system, which is
a realtime, incremental adaptation of the multibody SfM. However the proposed framework still
offers the flexibility of choosing the objects that needs to be reconstructed. Objects, not chosen
for reconstruction are simply tracked. This is helpful, since certain applications may just need
to know the presence of moving objects rather than its full 3D structure or there may not be
enough computational resource for realtime reconstruction of all moving objects in the scene.
The proposed system is a tightly coupled integration of various modules of feature tracking,
motion segmentation, visual SLAM, and moving object tracking while exploring various feed-
backs in between these modules. Fig. 1.4 illustrates system pipeline and outputs of each different
modules.
Reconstructing 3D trajectory of a moving point from monocular camera is ill-posed: it is
impossible, without making some assumptions about the way it moves. However object motions
are not random, and can be parameterised by different motion models. Typical assumptions
have been either that a point moves along a line or a conic or on a plane [1] or more recently
as a linear combination of basis trajectories [38]. Target tracking from bearings-only sensors
(which is also the case for a monocular camera) has also been studied extensively in “Bearings-
only Tracking” (BOT) literature [5, 24] where statistical filters seems to be the method of
choice. This same monocular observability problem gives rise to the so called “relative scale
problem” [12, 37] in multibody SfM. In other words since each independently moving body has
its 3D structure and camera motion estimated in its own scale, it results in a one-parameter
family of possible, relative trajectories per moving object w.r.t.static world. This needs to
resolved for a realistic, unified reconstruction of the static and moving parts together. Ozden
et al . [12] exploited the increased coupling between camera and object translations that tends
to appear at false scales and the resulting non-accidentalness of object trajectory. However
their approach is mostly batch processing, wherein trajectory data over time is reconstructed
for all possible scales, and the trajectory which for say is most planar is chosen by the virtue
of it being unlikely to occur accidentally. Instead, we take a different approach by making use
5
of a particle filter based bearing only tracker to estimate the correct scale and the associated
uncertainty (see Sec. 6.1).
In realtime visual SLAM systems, moving objects have not yet been dealt properly. In [62],
a 3D model based tracker runs parallel with the MonoSLAM [11] for tracking a previously
modeled moving object. This prevents the visual SLAM framework from incorporating moving
features lying on that moving object. But the proposed approach does not perform moving
object detection; so moving features apart from those lying on the tracked moving object can
still corrupt the SLAM estimation. Sola [49] does an observability analysis of detecting and
tracking moving objects with monocular vision. To bypass the observability issues with mono-
vision, he proposes a BiCamSLAM [49] solution with stereo cameras. A similar stereo solution
has also been proposed recently by [27].The work by Migliore et al . [31] maintains two separate
filters: a MonoSLAM filter [11] with the static features and a BOT for the moving features.
All these methods [27, 49, 62] have a common framework in which a single filtering based
SLAM [11] on static parts is combined with moving object tracking (MOT), which is often
termed as SLAMMOT [27]. Unlike SLAMMOT, we adopted multibody SfM kind approach
where multiple moving objects are also fully reconstructed simultaneously, but our framework
still allows simple tracking if full 3D structure estimation of moving object is not needed.
As concluded in Sec.4.2 of [38] and also in BOT literature, dynamic reconstruction with
mono-vision is good only when object and camera motion are non-correlated. To avoid this,
existing methods resorted to spiral camera motions [27], multiple photographers [38] or uncor-
related camera-object motion [12]. We do not have any restrictive assumptions on the camera
motion or environment. Instead, we extract more information from reconstruction pipeline in
form of cues, which are then used to constrain the uncertainty in moving object reconstruction.
We also do not have any restrictive assumptions on the camera motion or environment, and
thus works for difficult case of smoothly moving cameras, wherein object and camera motion
are highly correlated.
1.3 Thesis Overview and layout
1.3.1 Problem Statement
We propose a realtime incremental multibody visual SLAM algorithm. The final system
integrates feature tracking, motion segmentation, visual SLAM and moving object tracking.
We introduce several feedback paths among these modules, which enables them to mutually
benefit each other. The input to the system is image stream from a single moving monocular
camera. And we need to produce the following outputs in realtime:
6
Figure 1.4 The input to our system is monocular image sequence. Various modules of feature tracking,motion segmentation, visual SLAM and Moving Object tracking are interleaved and running online. The finalresult is an integrated dynamic map of the scene including 3D structure and 3D trajectory of the camera, staticworld and moving objects.
• Moving Object Detection.
• 3D reconstruction of static world points.
• 3D reconstruction of moving objects.
• 6DOF Camera trajectory in 3D.
• 6DOF Moving object trajectory in 3D.
• Integrated Dynamic 3D map of the environment.
1.3.2 System Overview
Fig. 1.4 illustrates system pipeline and outputs of each different modules. The feature
tracking module tracks existing feature points, while new features are instantiated. The purpose
of the motion segmentation module is to segment these feature tracks belonging to different
motion bodies, and to maintain this segmentation as new frames arrives. In the initialization
step, an algebraic multibody motion segmentation algorithm is used to segment the scene into
multiple rigidly moving objects. A decision is made as to which objects will be undergoing
the full 3D structure and camera motion estimation. The background object is always chosen
to undergo the full 3D reconstruction and camera motion estimation process. Other objects
may either undergo full SfM estimation or just simply tracked, depending on the suitability for
SfM estimation or application demand. On the objects, chosen for reconstruction, the standard
monocular visual SLAM pipeline is used to obtain the 3D structure and camera pose relative to
that object. For these objects, we compute a probabilistic likelihood that a feature is moving
along or moving independently of that object. These probabilities are recursively updated as the
7
Figure 1.5 The input to our system is monocular image sequence. Output is a realtime multibody recon-struction of the scene. This is snapshot of the video results attached with this thesis.
features are tracked. Also the probabilities take care of uncertainty in pose estimation by the
visual SLAM module. Features with less likelihood of fitting one model are either mismatched
features arising due to tracking error or features belonging to either some other reconstructed
object or one of the unmodeled independently moving objects. For the unmodeled moving
objects, we use spatial proximity and motion coherence to cluster the residual feature tracks
into independently moving entities.
The individual modules of feature tracking, motion segmentation and visual SLAM are
tightly coupled and various feedback paths in between them are explored, which benefits each
other. The motion model of a reconstructed object estimated from the visual SLAM module
helps in improving the feature tracking. Relative camera pose estimates from SLAM are used
by motion segmentation module to compute probabilistic model-fitness. The uncertainty in
camera pose estimate is also propagated into this computation, so as to yield robust model-
fitness scores. The computation of the 3D structure also helps in setting a tighter bound in
the geometric constraints, which results in more accurate independent motion detection.These
results from the motion segmentation are fed back to the visual SLAM module. The motion
segmentation prevents independent motion from corrupting the structure and motion estimation
by the visual SLAM module. This also ensures a less number of outliers in the reconstruction
process of a particular object. So we need less number of RANSAC iterations [16] thus resulting
in improved speed in the visual SLAM module. We then describe motion cues coming from SfM
estimate done on that moving object and several geometric cues imposing constraints on possible
8
depth and velocities, made possible due to reconstruction of static world. Integration of multiple
cues reduces immensely the space of possible trajectories and provides an online solution for
the relative scale problem. This enables a unified representation of the scene containing 3D
structure of static world, moving objects, 3D trajectory of the camera and moving objects along
with associated uncertainty.
1.3.3 Thesis layout
This thesis can be visualized as the amalgamation of four different modules, namely
• Feature Tracking
• Motion detection and segmentation
• Visual SLAM
• Moving Object Tracking
In Chap. 2, we briefly present the feature tracking methodology tested and being used for the
system. Then, we discuss the multibody motion segmentation and detection framework in
Chap. 3. We explain the geometric constraints used for detecting independent motion and also
present a robust probability framework for th same. This is followed by the discussion of our
visual SLAM framework in Chap. 4 using efficient Lie Group theory. We present the particle
filter based Moving Object Tracking in Chap. 5. We also present various cues from the visual
SLAM module and the process of integrating them to tracking framework. Finally in Chap. 6,
we put everything together to build a unified map of the dynamic scene. We present the relative
scale problem, and how the cues from Visual SLAM and moving object tracking module are
used to overcome this problem. In this chapter, we also show the final results of the proposed
multibody reconstruction system on multiple real image datasets.
9
Chapter 2
Feature Tracking
The feature tracking module tracks existing feature points, while new features are instanti-
ated. It is an important sub-module that needs to be improved for multibody visual SLAM to
take place. Contrary to conventional SLAM, where the features belonging to moving objects
are not important, we need to pay extra caution to feature tracking for multibody SLAM. For
multibody visual SLAM to take place, we should be able to get feature tracks on the moving
bodies also. This is challenging as different bodies are moving at different speeds. Also 3D
reconstruction is only possible, when there are sufficient feature tracks of a particular body.
However, relaxing the feature matching threshold also invites more mismatches. This increase
in outliers can even break the robust motion segmentation, or lead it to wrong convergence.
The tracking module is interleaved with motion segmentation and Visual SLAM, which allows
it to get benefited from these modules.
A short overview of the feature tracking methodology adopted by us is as follows. In each
image, a number of salient features are detected while ensuring the features are sufficiently
spread all over the image. Contrary to conventional visual SLAM, new features are added
almost every frame. However only a subset of these, detected on certain keyframes are made
into 3D points. The extra set of tracks helps in detecting independent motion. In order
to preserve feature tracks belonging to independent motions, we do not perform restrictive
matching initially. Instead the feature matching is performed in two stages. In the 1st stage,
features are matched over a large range so as to allow matches belonging to moving objects.
A preliminary segmentation and motion estimate is made using this coarse matching. Finally
when the camera motion estimate is available, we resort to guided matching which yields a
larger number of features. In this stage, we make full use of the camera motion knowledge,
while matching features.
10
2.1 Feature Detectors and Descriptors
Feature tracking is traditionally achieved by a combination of feature detector and descriptor.
Feature detectors identify salient points, also called keypoints in an image. While descriptors
determine a numerical description of these, usually as a signature vector describing the local
neighborhood around the keypoint. With each image we independently detect features and
match them across images with the help of the descriptors computed over these features. The
search space can be restricted to features inside a window around original feature location in
the 2nd image. Another (less popular now-a-days) approach for getting feature tracks can is
by tracking the features around a small window around original location of the feature in the
image as in KLT [56, 46] or block matching based approaches. Feature detectors normally
look for distinct salient locations which can be localized distinctively and are repeatable along
the sequence. There are several alternatives like Harris corners, Good Features to Track [46],
SIFT [29], SURF [4] and FAST corners [41]. FAST [41] learns a decision tree which is turned
into C code to yield an extremely fast corner detector which also has good repeatability. The
downside being that features cluster at edges. One important post-processing after feature
detection is to ensure a uniform spread of the image features across whole image, as required for
robust relative motion computation. This can enforced locally in FAST corners using maximal
suppression [41], and globally using a simple quadtree. With quadtree representation of the
image, each cell is required to have a minimum number of features to achieve the require spread
of features across the image.
Descriptors like SIFT [29], SURF [4] and BRIEF [8] produces a signature which is quite
invariant to viewing direction, scale (distance), and minor illumination changes. Another choice
specifically useful for visual SLAM systems is to use intensity information of simple patch of 8x8
pixels around the keypoint, and warping it to take into account the viewpoint change between
the patch’s first observation with the current predicted estimate of the camera orientation and
position. This is not invariant to scale, the effect of which can be mitigated by computing
patches at different pyramidal levels of the image. The same is also effective for KLT based
tracking. This whole process is explained in more detail in Klein et al . [21]. Choice of the
detector + descriptor scheme is quite crucial for proper feature tracking. Often the best choice
varies for each dataset, and depends on number of factors like frame rate(baseline), number
of features required, image quality and computation resource available. Some of the good
choices in our opinion are SIFT+SIFT, FAST+BRIEF, FAST + “warped image patch” and
FAST/SURF+SURF. So we do alternate between different detector + descriptor combinations
mentioned above depending on the current dataset. But however in most of our experiments,
we have found FAST corners along with “warped image patch” as the best combination.
11
2.2 Feature Matching Constraints
Using descriptors or the warped image patches, the matching procedure boils down to a
nearest neighbors search. So a correct correspondence will be very close in descriptor space.
For warped image patches, we compute a zero-mean SSD scores which some resilience to lighting
changes. For standard descriptors like SIFT and SURF, an eucledian distance metric works
quite well. Its is also common to use other metrics like Hamming distance as in [8]. Interleaving
of the feature tracking module with the Visual SLAM module, provides us a couple of other
constraints which can be used to significantly improve the feature matching. They are discussed
next:
a) Adaptive Search Window: Between a pair of image, features are matched within a
fixed distance (window) from its location in one image. The size and shape of this window is
decided adaptively, based on the past motion of that particular body. For 3D points, whose
depth has been computed from the visual SLAM module, the 1D epipolar search is reduced to
just around the projection of the 3D point on the image with predicted camera pose.
b) Warp matrix for patch: When, we are going by the FAST + “warped image patch”
scheme, we can make use of the camera position estimate from the visual SLAM module to
apply an affine warp on the image patches to maintain view invariance from the patch’s first
and current observation. If the depth of a patch is unknown, only a rotation warp is made. For
the image patch of the 3D points, which have been triangulated, a full affine warp is performed.
This process is exactly same as the patch search procedure in Klein et al . [21].
c) Occlusion Constraint: Motion segmentation gives rough occlusion information, i.e.it
says whether some foreground moving object is occluding some other body. This information
helps in data association, particularly for features belonging to a background body, which are
predicted to lie inside the convex hull created from the feature points of a foreground moving
object. These occluded features are not associated, and are kept until they emerge out from
occlusion.
d) Backward Match and Unicity Constraint: When a match is found, we try to
match that feature backward in the original image. Matches, in which each point is the
other’s strongest match is kept. Enforcing unicity constraint amounts to keeping only the
single strongest, out of several matches for a single feature in the other image.
2.3 Dense Feature Matching
Apart from the above strategy of sparse local feature based approaches, we can also obtain
feature correspondence between images through dense global energy based methods. This is
becoming more popular [34, 57] and feasible these days, because of a marked improvement
12
in state of art optical flow methods [7, 53, 63] both in terms of speed and accuracy. Due
to the filling-in effect, optical flow provides a dense flow field, and thus a huge amount of
correspondences, that can increase the robustness of the motion estimation process for scenes
with too many texture-less surfaces like indoors with big white walls. Newcombe et al . [34]
uses GPU based implementation of optical flow [63] for dense realtime 3D reconstruction. Also
recently [57] evaluated the use of dense optical flow against standard sparse feature tracking
for computation of fundamental matrix.
For our system of multibody motion segmentation and reconstruction, extra dense corre-
spondence available can be used for obtaining a dense segmentation of moving objects in the
scene. Some of the selected features from these dense correspondence, based on the distinctive-
ness score can be used for the motion estimation module. This will also help reconstruction of
small moving objects, which suffer from dearth of enough proper features.
13
Chapter 3
Moving Object Detection and Segmentation
In this chapter, we present an incremental motion segmentation framework that can segment
feature points belonging to different motions and maintain the segmentation with time. The
solution to the moving object detection and segmentation problem will act as a bridge between
the static SLAM or SfM and its counterpart for dynamic environments. But, motion detection
from a freely moving monocular camera is an ill-posed problem and a difficult task. The
moving camera causes every pixel to appear moving. The apparent pixel motion of points
is a combined effect of the camera motion, independent object motion, scene structure and
camera perspective effects. Different views resulting from the camera motion are connected by
a number of multiview geometric constraints. These constraints can be used for the motion
detection task. Those inconsistent with the constraints can be labeled as moving or outliers.
Apart from the multibody motion segementation framework, we also present an algorithm
for independently moving object detection. This is the special case of multibody segmentation
framework, where-in only the static background is only used for structure and motion estima-
tion. Anything moving independently is detected as moving objects. We also present the case
when a monocular camera is mounted on a robot, and its odometry is being used for estimating
egomotion rather than the Visual SLAM routine.
3.1 Related Works
The problem of motion detection and segmentation from a moving camera has been a very
active research area in computer vision community. The multiview geometric constraints used
for motion detection, can be loosely divided into four categories. The first category of methods
used for the task of motion detection, relies on estimating a global parametric motion model
of the background. These methods [20, 39, 61] compensate camera motion by 2D homography
or affine motion model and pixels consistent with the estimated model are assumed to be
14
background and outliers to the model are defined as moving regions. However, these models
are approximations which hold only for certain restricted cases of camera motion and scene
structure.
The problems with 2D homography methods led to plane-parallax [18, 43, 66] based con-
straints. The “planar-parallax” constraints represents the scene structure by a residual displace-
ment field termed parallax with respect to a 3D reference plane in the scene. The plane-parallax
constraint was designed to detect residual motion as an after-step of 2D homography methods.
They are designed to detect motion regions when dense correspondences between small baseline
camera motions are available. Also, all the planar-parallax methods are ineffective when the
scene cannot be approximated by a plane.
Though the planar-parallax decomposition can be used for egomotion estimation [19] and
structure recovery [42], the traditional multi-view geometry constrains like epipolar constraint
in 2 views or trillinear constraints in 3 views and their extension to N views have proved to
be much more effective in scene understanding as in SfM or visual SLAM. This constraints are
well understood and are now textbook materials [15, 30, 17].
3.2 Requirements
A robust moving object detection and segmentation algorithm is fundamental for a efficient
SLAM in a dynamic environment. A successful algorithm for motion detection and segmentation
to aid visual SLAM should ideally satisfy the following requirements:
a) Incremental solution: We need a incremental solution, where the moving objects are
detected and segmented as early as possible while the segmentation hypothesis gets updated
with new frames. This is fundamental to visual SLAM which demands an incremental solution
as future frames are not available. We use 2-view motion segmentation which is then incremen-
tally extended to multiple views as the new frames arrive. In contrast to batch processing with
fixed number of images sequences, the proposed approach allows detection of objects moving
at different relative speeds. The incremental nature of the solution also allows the individual
modules of motion segmentation, feature tracking and visual SLAM to be interleaved, which in
turn allows them to get benefited from one another.
b) No restrictive assumptions: Many of the moving object detection has restricted
assumptions on scene structure, camera models or camera motion. For e.g. the methods
of [18, 28, 66] assumes a planar scene, whereas [39, 20, 61] considers an affine camera model.
However these assumptions are often invalid in real environments. The proposed approach has
no assumption about the scene structure and considers a full perspective camera model.
15
c) Seamless integration with existing VSLAM/SfM solutions: The existing meth-
ods for moving object detection avoids computation of scene structure or camera egomotion.
For e.g. the parallax rigidity constraint of [18] or modeling of the static background as in [20];
all performs computations which cannot be used for scene structure or camera egomotion esti-
mation. The proposed motion segmentation approach makes use of epipolar geometry, which
forms the backbone of the standard visual SLAM methods and thus avoids extra computations
and can be easily integrated with the existing SfM solutions.
d)Ability to handle both 2D and 3D points where depth is known: There are two
kinds of feature points in the system: 2D points which are yet to be triangulated and 3D points
which have been triangulated by visual SLAM module. The knowledge of depth adds additional
constraints to be used by the motion segmentation module.
e)Ability to handle degenerate motions: Detecting independently moving objects be-
come tough, when camera motion and the motion of the moving object is along the same
direction. The features belonging to moving object then moves along the epipolar line and
thus the epipolar constraint is not able to detect it. These set of motions called degenerate
motions [66] are very common in real world like a camera following another car moving along
the road, or a robot mounted camera following a moving person. Standard visual SLAM sys-
tems detects outliers with help the of reprojection error, which is somewhat equivalent to the
measure of epipolar distance. So they are not able to detect degenerate motions.
3.3 Overview
The input to the motion segmentation framework is feature tracks from feature tracking
module, the camera relative motion in reference to each reconstructed body from the visual
SLAM module, and the previous segmentation. The motion segmentation module needs to
verify the existing segmentation, and also associate new features to one of the moving objects.
As new frames arrives, the number of independently moving objects changes as objects enters
or leaves the scene, part of existing object splits to move independently or the reverse case of
two independent motions merging. So the motion segmentation framework needs to detect the
change in the number of moving objects and update accordingly.
The task of the motion segmentation module is that of model selection so as to assign these
feature tracks to one of the reconstructed bodies or some unmodeled independent motion. Effi-
cient geometric constraints are used to form a probabilistic fitness score for each reconstructed
object. With each new frame, existing features are tested for model-fitness and unexplained
features are assigned to one of the independently moving object. But before all this, we should
initialize the motion segmentation, which is described next.
16
3.4 Initialization of Motion Segmentation
The initialization routine for motion segmentation and visual SLAM is somewhat different
from rest of the algorithm. We make use of an algebraic two-view multibody motion segmen-
tation algorithm of RAS [40] to segment the input set of feature trajectories into multiple
moving objects. The reasons behind the choice of [40] among other algorithms is its direct
non-iterative nature and faster computation time. This segmentation provides the system, the
choice of motion bodies for reconstruction. For the segment chosen for reconstruction, an ini-
tial 3D structure and camera motion is computed via epipolar geometry estimation as part of
static-scene visual SLAM initialization routine.
3.5 Geometric Constraints
Between any two frames, the camera motion with respect to the reconstructed body is
obtained from the visual SLAM module. The geometric constraints are then estimated to detect
independent motion with respect to the reconstructed body. So for the static background, all
moving objects should be detected as independent motion.
For a camera moving relative to a scene, the fundamental matrix is given by F = [Kt]×KRK−1
where K is the intrinsic matrix of the camera and R, t is the rotation and translation of the
camera between two views.
3.5.1 Epipolar Constraint
Epipolar constraint is the commonly used constraint that connects two views. Epipolar
constraint is best explained through fundamental matrix [17]. The fundamental matrix is a
relationship between any two images of a same scene that constrains where the projection of
points from the scene can occur in both images. It is a 3x3 matrix of rank 2 that encapsulates
camera’s intrinsic parameters and the relative pose of the two cameras. Reprojection error or
its first order approximation called Sampson error, based on the epipolar constraint is used
throughout the structure and motion estimation by the visual SLAM module. Basically they
measure how far a feature lies from the epipolar line induced by the corresponding feature in
the other view. Though these are the gold standard cost functions for 3D reconstruction, it
is not good enough for independent motion detection. If a 3D point moves along the epipolar
plane formed by the two views, its projection in the image move along the epipolar line. Thus
in spite of moving independently, it still satisfies the epipolar constraint. This is depicted in
Fig. 3.1.
17
Figure 3.1 Left: The world point P moves non-degenerately to P′
and hence x′, the image of
P′does not lie on the epipolar line corresponding to x. Right: The point P moves degenerately
in the epipolar plane to P′. Hence, despite moving, its image point lies on the epipolar line
corresponding to the image of P.
Let pn and pn+1 be the images of some 3D point, X in a pair of images In, In+1 ob-
tained at time instants tn and tn+1. Let Fn+1,n be the fundamental matrix relating the two
images In, In+1, with In as the reference view. Then epipolar constraint is represented by
pTn+1Fn+1,npn = 0 [17]. The epipolar line in In+1, corresponding to pn is ln+1 = Fn+1,npn. If
the 3D point is static then pn+1 should ideally lie in ln+1. But if a point is not static, the
perpendicular distance from pn+1 to the epipolar line ln+1, depi is a measure of how much the
the point deviates from epipolar line. If the coefficients of the line vector ln+1 are normalized,
then depi = |ln+1 · pn+1|. However, when a 3D point moves along the epipolar plane, formed
with the two camera centers and the point P itself, the image of P still lies on the epipolar line.
So the epipolar constraint is not sufficient for degenerate motion. Fig. 3.1 shows the epipolar
geometry for non-degenerate and degenerate motions.
This kind of degenerate motion is quite common in real world scenarios, e.g. camera and
an object are moving in same direction as in camera mounted in car moving through a road,
or camera-mounted robot following behind a moving person. To detect degenerate motion, we
make use of the knowledge of camera motion and 3D structure to estimate a bound in the
position of the feature along the epipolar line. We describe this as Flow Vector Bound (FVB)
constraint.
3.5.2 Flow Vector Bound (FVB) Constraint
For a general camera motion involving both rotation and translation R, t, the effect of
rotation can be compensated by applying a projective transformation to the first image. This is
achieved by multiplying feature points in view1 with the infinite homography H = KRK−1 [17].
The resulting feature flow vector connecting feature position in view2 to that of the rotation
18
compensated feature position in view1, should lie along the epipolar lines. Now assume that
our camera translates by t and pn, pn+1 be the image of a static point X. Here pn is normalized
as pn = (u, v, 1)T . Attaching the world frame to the camera center of the 1st view, the camera
matrix for the views are K[I|0] and K[I|t]. Also, if z is depth of the scene point X, then
inhomogeneous coordinates of X is zK−1pn. Now image of X in the 2nd view, pn+1 = K[I|t]X.
Solving we get, [17]
pn+1 = pn +Kt
z(3.1)
Equation 3.1 describes the movement of the feature point in the image. Starting at point pn
in In it moves along the line defined by pn and epipole, en+1 = Kt. The extent of movement
depends on translation t and inverse depth z. From eq. 3.1, if we know depth z of a scene
point, we can predict the position of its image along the epipolar line. In absence of any depth
information, we set a possible bound in depth of a scene point as viewed from the camera. Let
zmax and zmin be the upper and lower bound on possible depth of a scene point. We then find
image displacements along the epipolar line, dmin and dmax, corresponding to zmax and zmin
respectively. If the flow vector of a feature, does not lie between dmin and dmax, it is more likely
to be an image of an independent motion.
The structure estimation from visual SLAM module helps in reducing the possible bound in
depth. Instead of setting zmax to infinity, known depth of the background enables in setting a
more tight bound, and thus better detection of degenerate motion. The depth bound is adjusted
on the basis of depth distribution along the particular frustum.
The probability of satisfying flow vector bound constraint P (FV B) can be computed as
P (FV B) =1
1 +
(FV − dmean
drange
)2β(3.2)
Here dmean =dmin + dmax
2and drange =
dmax − dmin2
, where dmin and dmax are the bound
in image displacements. The distribution function is similar to a Butterworth bandpass filter.
P (FV B) has a high value if the feature lies inside the bound given by FVB constraint, and
the probability falls rapidly as the feature moves away from the bound. Larger the value of β,
more rapidly it falls. In our implementation, we use β = 10.
3.6 Independent Motion Probability
In this section we describe a recursive formulation based on Bayes filter to derive the proba-
bility of a projected image point of a world point being classified as stationary or dynamic.The
relative pose estimation noise and image pixel noise are bundled into a Gaussian probability
19
distribution of the epipolar lines as derived in [17] and denoted by ELi = N (µli,∑
li), where
ELi refers to the set of epipolar lines corresponding to image point i, and N (µli,∑
li) refers to
the standard Gaussian probability distribution over this set.
Let pni be the ith point in image In. The probability that pn
i is classified as stationary
is denoted as P (pni|In, In−1) = Pn,s(p
i) or Pn,si in short, where the suffix s signifying static.
Then with Markov approximation, the recursive probability update of a point being stationary
given a set of images can be derived as
P (pni|In+1, In, In−1) = ηs
iPn+1,siPn,s
i (3.3)
Here ηsi is normalization constant that ensures the probabilities sum to one.
The term Pn,si can be modeled to incorporate the distribution of the epipolar lines ELi.
Given an image point pn−1i in In−1 and its corresponding point pn
i in In then the epipolar line
that passes through pni is determined as ln
i = en × pni. The probability distribution of the
feature point being stationary or moving due to epipolar constraint is defines as
PEP,si =
1√2π|Σl|
exp(−1
2(ln
i − µni)τΣ−1l (ln
i − µni)) (3.4)
However this does not take into account the misclassification arising due to degenerate mo-
tion explained in previous sections. To overcome this, the eventual probability is fused as a
combination of epipolar and flow vector bound constraints:
Pn,si = α · PEP,si + (1− α) · PFV B,si (3.5)
where, α balances the weight of each constraint. A χ2 test is performed to detect if the epipolar
line lni due to the image point is satisfying the epipolar constraint. When Epipolar constraint
is not satisfied, α takes a value close to 1 rendering the FVB probability inconsequential. As
the epipolar line lni begins indicating a strong likelihood of satisfying epipolar constraint, the
role of FVB constraint is given more importance, which can help detect the degenerate cases.
An analogous set of equations characterize the probability of an image point being dynamic,
which are not delineated here due to brevity of space. In our implementation, the envelope
of epipolar lines [17] is generated by a set of F matrices distributed around the mean R, t
transformation between two frames as estimated by visual SLAM module. Hence a set of
epipolar lines corresponding to those matrices are generated and characterized by the sample
set, ELssi =
(l1i, l2
i.......lqi)
and the associated probability set, PEL =(wl1
i, wl2i.......wlq
i)
where each wlji is the probability of that line belonging to the sample set ELss
i computed
through usual Gaussian procedures. Then the probability that an image point pni is static is
given by:
Pn,si =
q∑j=1
αj · PEP,ljiS · pni + (1− αj) · PFV B,lji
S · pni · wlj i (3.6)
20
where, PEP,ljiS and PFV B,lji
S are the probabilities of the point being stationary due to the
respective constraints with respect to the epipolar line lji.
3.7 Clustering Unmodeled Motions
Features with high probabilities of being dynamic are either outliers or belongs to potential
moving objects. Since these objects are often small, and highly dynamic, they are very hard to
be reconstructed. So instead we adopt a simple move-in-unison model for them. Spatial prox-
imity and motion coherence is used to cluster these feature tracks into independently moving
entities. By motion coherence, we use the heuristic that the variance in the distance between
features belonging to same object should change slowly in comparison.
3.8 Computation of Fundamental Matrix from Odometry
If our camera is mounted on robot, we can make use of the easily available robot odometry
to get the relative rotation and translation of the camera between a pair of captured images.
It is also common to fuse both this with the egmotion information obtained from visual SLAM
module for better accuracy. In our experiments with indoor robots (Pioneer P3DX), we have
found robot odometry alone was good enough for our task. Also since, we only make use of
relative pose information between a pair of views; the incrementally growing odometry error
does not creep into the system. The following two sections discuss the main issues that come
up, when camera motion is estimated from odometry.
3.8.1 Synchronization
To correctly estimate the camera motion between a pair of frames, it is important to have
correct odometry information of the robot at the same instant when a frame is grabbed by the
camera. However the images and odometry information are obtained from independent channels
and are not synchronized with each other. For firewire cameras, accurate timestamp for each
captured image can be easily obtained. Odometry information from the robot is stored against
time, and then interpolating between them, we can find where the robot was at a particular
point in time. Thus the synchronization is achieved by interpolating the robot odometry to the
timestamp of the images obtained from the camera.
21
3.8.2 Robot-Camera Calibration
The robot motion is transformed to the camera frame to get the camera motion between
two views. The transformation between the robot to camera frame was obtained through a
calibration process similar to the Procedure A described in [59]. A calibration object such as
a chess board is used and a coordinate frame fixed to it. The transformation of this frame to
the world frame is known and described as TWO , where O refers to the object frame and W the
world frame. Also known are the transformation of the frame fixed to the robot center with
the world frame, TWR and the transformation from camera frame to object frame, TOC , obtained
through the usual extrinsic calibration routines. Then the transformation of the camera frame
with the robot frame is obtained as TRC = TRWTWO TOC . If the transformation of the calibration
object from the world frame is not easily measurable, the mobility of the robot can be used for
the calibration. The calibration in that case will be similar to the hand-eye calibration [52, 59].
3.8.3 Preventing Odometry Noise
The top-left and top-right images of figure set 3.2 shows a feature of a static point tracked
between the two images. The feature is highlighted by a red dot. The bottom figure of fig. 3.2
depicts a set of epipolar lines in green generated for this tracked feature as a consequence of
modeling noise in camera egomotion estimation as described in section 3.6. The mean epipolar
line is shown in red. Since the features are away from the mean line they are prone to be
misclassified as dynamic in the absence of a probabilistic framework. However as they lie on
one of the green lines that is close to the mean line their probability of being classified as
stationary is more than being classified as dynamic. This probability increases in subsequent
images through the recursive Bayes filter update if they come closer to the mean epipolar line
while lying on one of the set of lines. It is to be noted that an artificial error was induced in
robot motion for the sake of better illustration. Also note that the two frames are separated
by relatively large baseline. In general the stationary points do not deviate as much as shown
in the bottom figure of Fig. 3.2.
3.9 Results of Moving Object Detection
This section shows results of the motion segmentation presented in this chapter. The system
has been tested on a number of real image datasets, with varying number and type of moving
entities. However we postpone most of the multibody motion segmentation results to Sec. 6.4.
In this chapter we concentrate more on moving object detetion from a camera mounted robot.
22
Figure 3.2 Left: A stationary feature shown in red. Middle: The same feature tracked ina subsequent image. Right: Though, the feature is away from the mean epipolar lines due toodometry noise, it still lies on one of the lines in the set.
Supplementary Video 1: A video showing detection of multiple people and other moving
objects while the robot moves and maneuvers around obstacles.
3.9.1 Robot mounted Camera Sequence
We show experimental results on various test scenarios on a ActivMedia Pioneer-P3DX
Mobile Robot. A single IEEE 1394 firewire camera (Videre MDCS2) mounted on the robot
was the only sensor used for the experiment. Images of resolution 320×240 captured at 30Hz
was processed on a standard onboard laptop.
Fig. 3.3 depicts a typical degenerate motion, being detected by the system. The left and
right figures of the top row shows the P3DX moving behind another robot, called MAX in our
lab. The salient features are shown in red. The left figure of the mid row shows the flow vectors
in yellow. The red dot at the tip of the yellow line is akin to an arrow-head indicating the
direction of the flow. The right figure of the mid row shows epipolar lines in gray. It also shows,
that the flow vectors on MAX move towards the epipole while the flow vectors of stationary
features move away from it. The left figure of the bottom row shows the features classified as
moving, which are marked with green dots. All the features classified as moving lies on the
MAX, as expected. The bottom right figure highlights the moving regions in green shade, which
is made by forming a convex hull from the cluster of moving features.
Fig. 3.4 depicts motion detection when the robot is simultaneously performing both rotation
and translation. Images in the top row show images grabbed during two instants separated by
30 frames, as a person moves before a rotating while translating camera. The left figure in
middle row shows the flow vectors, while the right figure in the middle row shows the epipolar
lines in gray and perpendicular distances of features from their expected (mean) epipolar lines
in cyan. Longer cyan lines indicate a feature is having a greater perpendicular distance from
the epipolar line. The left figure in bottom row depicts the features classified as moving in
23
Figure 3.3 Top Left: An image with stationary objects and a moving robot, MAX, aheadof the P3DX. The KLT features shown in red. Top Right: A subsequent image where MAXhas moved further away. Mid Left: The flow vectors shown in yellow. Middle Right: Theflow vectors of stationary features moves away from epipole, while MAX’s flow vectors movescloser to the epipole. Bottom Left: Image with only the dynamic features in green. BottomRight: Convex hull in green overlaid over the motion regions.
24
Figure 3.4 Top Left: An image with stationary objects and a moving person as the P3DXrotates while translating. The KLT features are shown in red. Top Right: A subsequentimage after further rotation and translation Middle Left: The flow vectors shown in yellow.Middle Right: Flow vectors in yellow, epipolar lines in gray and perpendicular distancesin cyan. Bottom Left: Features classified as dynamic, shown in green. Bottom Right:Convex hull in green overlaid over motion regions.
25
green as they all lie on the moving person. The right figure of the bottom row shows the convex
hull in green formed from the clustered moving features, as it gets overlaid on the person.
3.9.2 Handheld Indoor Lab Sequence
This is an indoor sequence taken from an inexpensive hand-held camera. As the camera
moves around, moving persons enter and leave the scene. Fig. 3.5 shows the results for this
sequence. The bottom right picture in Fig. 3.5 shows how two spatially close independent
motions is clustered correctly by the algorithm. This sequence also involves a lot of degenerate
motion as the camera and the persons move in same direction. The 3D structure estimation of
the background helps in setting a tighter bound in the FVB constraint. The depth bound is
adjusted on the basis of depth distribution of the reconstructed background along the particular
frustum, as explained in Sec. 3.5.2.
Figure 3.5 Results from the Indoor Lab Sequence
26
3.9.3 Detection of Degenerate Motions
Fig. 3.6 shows an example of degenerate motion detection, as the flow vectors on the moving
person almost move along epipolar lines, but they are being detected due to usage the FVB con-
straint. This results verifies system’s performance for arbitrary camera trajectory, degenerate
motion and changing number of moving entities.
Figure 3.6 Epipolar lines in Grey, flow vectors after rotation compensation is shown in orange.Cyan lines show the distance to epipolar line. Features detected as independently moving areshown as red dots. Note the near-degenerate independent motion in the middle and right image.However the use of FVB constraint enables efficient detection of degenerate motion.
3.9.4 Person detection
Some applications demand people to be explicitly detected from other moving objects. We
use “part-based” representations [67, 64] for person detection. The advantage of the part-based
approach is that it relies on body parts and therefore it is much more robust to partial occlu-
sions than the standard approach considering the whole person. We model our implementation
as described in [67]. Haar-feature based cascade classifiers was used to detect different human
body parts, namely upper body, lower body, full body and head and shoulders. These detectors
often leads to many false alarms and missed detections. Bottom-left image of Fig. 3.7 depicts
the false detections, by this individual detectors. A probabilistic combination [67] of these indi-
vidual detectors gives a more robust person detector. But running four Haar-like-feature based
detectors on the whole image takes about 400ms, which is very high for realtime implemen-
tation. We use knowledge of motion regions as detected by our method, to reduce the search
space of part detectors. This greatly reduces the computations and the time taken is mostly
less than 40ms. Also the detections have less false positives.
27
Figure 3.7 TOP LEFT: A scene involving a moving toy car and person from the indoorsequence. TOP RIGHT: Detected moving regions are overlaid in green. BOTTOM LEFT:Haar classifier based body part detectors. BOTTOM RIGHT: Person detected by part-basedperson detection over image regions detected as moving.
28
Chapter 4
Visual SLAM Framework
Both SfM from computer vision and SLAM in mobile robotics research does the same job of
estimating sensor motion and structure of an unknown static environment. Performing SLAM
with a single video camera, while an attractive prospect, adds its own particular difficulties
to the already considerable general challenges of the problem. In this chapter we put forward
the visual SLAM module used in our system. For each body/object chosen for reconstruction,
the visual SLAM module computes the structure of that body and also the camera trajectory
w.r.t. that body. SfM has been studied by scientists for last three decades, and most of the
mathemetical theories are now text book material [17, 15]. However practical SfM and visual
SLAM systems has come up only in the last decade, and lot of research is yet to happen
4.1 Related Works
Visual SLAM methods can be roughly categorized by two different approaches. The filtering
approaches of [11, 9, 49] recursively updates the state vector consisting of probability distribu-
tions over features and camera pose parameters. They employ filters like EKF or particle filter to
sequentially fuse measurement from all images. The second set of approaches [36, 32, 21, 47, 22]
are the real-time and incremental version of the standard batch SfM. To achieve real-time per-
formance, bundle adjustment style optimization is performed only over small number of past
frames selected through sliding window [36, 32, 47], or spatially distributed keyframes [21, 22].
The filter based approaches conventionally builds a very sparse map (about 10-30 features per
frame) of high quality features. Also they do not make use of any robust statistics to reject
outliers. Whereas the keyframe/bundle adjustment methods extracts as much correspondence
information as possible and typically uses some robust statistics like RANSAC to eliminate
outliers. A detailed comparison between these two approaches can be found in [50].
29
4.2 Visual SLAM Formulation
Visual SLAM or SfM estimates the camera pose denoted as gtCW and map points XW ∈ R3
with respect to a certain world frame W , at a time instant t. The structure coordinates XW are
assumed to be constant i.e.static in this world frame evident from the absence of time t in its
notation. In the multibody VSLAM scenario, the world frame W can be either the static world
frame S or a rigid moving object O, which has been chosen for reconstruction. The 4×4 matrix
gCW contains a rotation and a translation and transforms a map point from world coordinate
frame to camera-centred frame C by the equation XC = gCWXW . It belongs to the Lie group
of Special Euclidean transformations, SE(3). The tangent space of an element of SE(3) is
its corresponding Lie algebra se(3), so any rigid transformation is minimally parameterised as
a 6-vector in the tangent space of the identity element. We denote this minimal 6-vector as
ξ := (vTωT )T ∈ R6, where the first three elements is an axis-angle representation of rotation,
while the later three represents the translation. The ξ ∈ R6 represents the twist coordinates for
the twist matrix ξ ∈ se(3). Thus a particular twist is a linear combination of the generators of
the SE(3) group, i.e.
ξ =6∑i=1
ξiGi =
[ω v
0 0
]| ω ∈ so(3), v ∈ R3 (4.1)
Here ξi are individual elements of ξ and Gi are the 4 × 4 generator matrices which forms the
basis for the tangent space to SE(3). And ω is a skew-symmetric matrix obtained from the
3-vector ω. The exponential map exp : se(3)→ SE(3) maps a twist matrix to its corresponding
transformation matrix in SE(3) and can be computed efficiently in closed form. Changes in the
camera pose gCW is obtained by pre-multiplying with a 4× 4 transformation matrix in SE(3).
Thus the camera pose evolves with time as:
gt+1CW = ∆gtgtCW = exp(ξ)gtCW (4.2)
The world points XW are first transformed to camera frame and then projected in the image
plane using a calibrated camera projection model CamProj(.). This defines our measurement
function z as:
z =
(u
v
)= CamProj(gCWXW ) (4.3)
In each visual SLAM, the state vector x consists of a set of camera poses and reconstructed 3D
world points. The optimization process aims at better state vector x by iteratively improving it
so that it minimizes a sum of square errors between some prediction and observed data z. The
incremental updates in optimization are calculated as in Eq. 4.2 at the tangent space around
identity se(3) and mapped back onto manifold. This enables minimal representation during
optimization and avoids singularities. Also the Jacobians of the above equations needed in the
30
optimization process can be readily obtained in closed form. Due to this advantages, the Lie
theory based representation of rigid body motion is becoming popular among recent VSLAM
solutions [23, 51]. We again use this Lie group formulation in tracking of the moving object
described in Chapter 5.
The monocular visual SLAM framework is that of a standard bundle adjustment visual
SLAM [21, 32, 51]. A 5-point algorithm with RANSAC is used to estimate the initial epipolar
geometry, and subsequent pose is determined by camera resection. Some of the frames are
selected as keyframes, which are used to triangulate 3D points. The set of 3D points and the
corresponding keyframes are used in by the bundle adjustment process to iteratively minimize
reprojection error. The bundle adjustment is initially performed over the most recent keyframes,
before attempting a global optimization. Our implementation closely follows to that of [21, 32].
While one thread performs tasks like camera pose estimation, keyframe decision and addition,
another back-end thread optimizes this estimate by bundle adjustment. But there are couple of
important differences with the existing SLAM methods, namely its interplay with the motion
segmentation, bearing only object and feature tracking module, reconstruction of small moving
objects. They are discussed next.
4.3 Feedback from Motion Segmentation
The motion segmentation prevents independent motion from entering the VSLAM computa-
tion, which could have otherwise resulted in incorrect initial SfM estimate and lead the bundle
adjustment to converge to local minima. The feedback results in less number of outliers in the
SfM process of a particular object. Thus the SfM estimate is better conditioned and less number
of RANSAC iterations is needed. Apart from improvement in the camera motion estimate, the
knowledge of the independent foreground objects coming from motion segmentation helps in
the data association of the features, which is currently being occluded by that object. For the
foreground independent motions, we form a convex-hull around the tracked points clustered as
an independently moving entity. Existing 3D points lying inside this region is marked as not
visible and is not searched for a match. This prevents 3D features from unnecessary deletion
and re-initialization, just because it was occluded by an independent motion for some time.
4.4 Dealing Degenerate Configurations
In dynamic scenes, moving objects are often small compared to the field of view, and often
appear planar or has very less perspective effects. Then both relative pose estimation and
camera resection faces ambiguity and results in significant instability. During relative pose
31
estimation from two views, coplanar world points can cause at most a two-fold ambiguity. So
we use 5-point algorithm from 3 views to resolve this planar degeneracy, exactly as described
in [35]. Though theoretically, calibrated camera resection from a coplanar set of points has a
unique solution unlike its uncalibrated counterpart, it still suffers from ambiguity and instability
as shown in [45]. So for seemingly small and planar objects we modified the EPnP code as in
Sec. 3.4 of [26] to initialize the resection process, which is then refined by bundle adjustment.
32
Chapter 5
Moving Object Tracking
A monocular camera is a projective sensor that only provides the bearing information of the
scene. So moving object tracking with mono-vision is a bearings-only tracking (BOT) which
aims to estimate the state of a moving target comprising of its 3D position and velocity. A
single BOT filter is employed on each independently moving objects. At any time instant t,
the camera only observes the bearing of tracked feature on the moving object. We consider the
moving object state vector as gtOS ∈ SE(3), representing 3D rigid body transformation of the
moving object O in the static world frame S. Through visual SLAM on the static body, we
already know the camera pose gtCS ∈ SE(3). Due to inherent non-linearity and observability
issues, particle filter has been the preferred approach [5] for BOT. In this chapter we develop
a formulation of the particle filter based BOT that integrates multiple cues from static world
reconstruction.
We start with the simple BOT framework in the absence of any cues. Reconstruction of
static world provides various cues which helps in constraining the moving object’s depth and
velocity. Sec. 5.1.3 describes how those constraints are integrated as tracker iterates through
time.
5.1 Particle Filter based BOT
The uncertainty in pose of the object is represented by the poses of set of particles giS
and their associated weights. Each particle’s state denoted by gtiS ∈ SE(3) represents its pose
w.r.t.S at a time instant t. We continue with Lie group preliminaries discussed in Chap. 4. We
assume an instantaneous constant velocity (CV) motion model, which is considered the best
bet and most generic for modeling an unknown motion. Mean velocity between two intervals
is represented by the mean twist matrix˜ξti = 1
∆t ln(gti(gt−1i )−1), where
˜ξ ∈ se(3) is the mean
twist matrix associated with the mean six dimensional velocity vector ξ ∈ R6. The motion
33
model of the particle then generates samples according to the pdf i.e.probability distribution
function p(gt+1iS |gtiS , ξti). Each component of the mean velocity vector has a Gaussian error with
a standard deviation σj , j ∈ {1, . . . 6}. To transform this Gaussian distribution in R6 to SE(3)
space the following procedure is used. We define a vector α ∈ R6, whose each component αj
is sampled from the Gaussian N (0, σ2j ), then α is the twist matrix associated with αj . Then
ξti = eα˜ξti generates samples in the twist matrix space of R4X4 corresponding to the Gaussian
errors centred at the mean velocity vector. Then the dynamic model of the particle generates
samples that approximate the pdf given before as
gt+1iS = exp(α) exp(
˜ξti∆t)g
tiS (5.1)
The measurement model that predicts the location of a particle with SE(3) pose gt+1i in the
image as
zt+1i =
(ut+1i
vt+1i
)= CamProj(Trans(gt+1
CS gt+1i )) (5.2)
Here Trans(.) operator extracts the translation vector associated with the SE(3) pose of the
particle and CamProj(.) is the camera projection Eq. 4.3. The weight wi of the particle is
updated as wt+1i = 1√
(2π)ηexp( (z−zi)′(z−zi)
2η2), where z is the actual image coordinate of the
feature being tracked. The particles then undergo resampling in the usual particle filter way:
particles with a higher weight have higher probability of getting resampled.
5.1.1 Ground Plane, Depth Bound & Size Bound
The structure estimation of the static world from visual SLAM module helps in reducing the
possible bound in depth. Instead of setting the maximum depth to infinity, known depth of the
background allows to limit the depth of a foreground moving object. The depth bound (DB)
is adjusted on the basis of depth distribution of static world map points along the particular
frustum of the ray. This bound gets updated as the camera moves around in the static world.
The 3D point cloud of the static world is used to estimate the ground-plane (GP). Using the
fact that most real world objects move over the ground plane, we can add constraints to the
velocity vector such that its height above the ground plane is constant. Both the above cues
ignored that we are able to track multiple features of the object. At wrong depths, this points
may be reconstructed to lie below the ground-plane or too much above it. This criteria of
size and unrealistic reconstructions is used to get an additional depth constraint. All these
cues constraints the possible depth or velocity space. Integration of these depth and velocity
constraints into the BOT filter is discussed in Sec. 5.1.3.
34
5.1.2 Initialization
Initialization is an important step for performance of particle filter in BOT. For a moving
object which enters the scene for the first time, particles are initialized all along the ray starting
from the camera through the image point which is the projection of a point on the dynamic
object being considered. A uniform sampling is then used to initialize the particles at various
depths inside that bound [dmin, dmax] computed from the depth bound cue described previously
in Sec. 5.1.1. The velocity components are initialized in a similar manner. At each depth,
number of particles with various velocities are uniformly sampled so that the speeds lie inside a
predetermined range [smin, smax] along all possible directions. When a previously static object
starts moving independently, we can do better initialization than uniform sampling: we initialize
the depth as normal distribution N (d, σ2), where d is the depth estimate obtained from the
point’s reconstruction as part of the original body.
5.1.3 Integrating Depth and Velocity Constraints
Depth and velocity constraints play a very important role in improving the tracker perfor-
mance, even in scenarios which are otherwise unobservable for a bearing only tracker. This
reduces the space of state vectors to some constrained set of state vectors denoted as ψ. This
can be implemented as the motion model, by sampling from a truncated density function ps,
defined as:
ps =
p(gt+1iS |gtiS , ξti) gt+1
iS ∈ ψ0 otherwise
(5.3)
Here, non-truncated pdf over motion model, p(gt+1iS |gtiS , ξti) is evaluated from Eq. 5.1. To draw
samples from this truncated distribution, we use rejection sampling over the distribution, until
the condition giS ∈ ψ is satisfied. This method of rejection sampling is sometimes inefficient.
So in our implementation, we restrict the number of trials and if it still does not lie inside ψ,
we flag those particles for lower weight in the measurement update step.
35
Chapter 6
Unification: putting evrything together
In this chapter we discuss, how all the different modules of feature tracking, motion segmen-
tation, visual SLAM and moving object tracking are put together and used to build an unified
multibody reconstruction of the dynamic scene. We aim to build a unified 3D map of the
dynamic world which changes in time, and thus provides information about the moving objects
apart from the static world. In this chapter, we primarily discuss the most important problem
namely the “Relative Scale problem” that hinders such unified multibody reconstruction from
a monocular vision. We also present the final results of our multibody reconstruction system
in this chapter.
6.1 Relative Scale Problem
Performing visual SLAM on the moving object, we obtain the gtCO ∈ SE(3) and object
points XO ∈ R3 with respect to the object frame O. Now we also obtain the camera pose gCS
in the static world frame S. Thus configuration of the moving object O w.r.t.static world S
can be obtained as gOS = g−1COgCS . Expanding this equation in the homogenous representation
we obtain: [ROS tOS
0 1
]=
[RTCS −RT
CStCS
0 1
][RCO tCO
0 1
](6.1)
Equating the rotation and translation parts of Eq. 6.1, we obtain ROS = RTCSRCO & tOS =
RTCStCO − RT
CStCS . We can obtain the exact ROS , but from monocular SfM, we can only
obtain tCO and tCS up to some unknown scales [17]. We can fix the scale for tCS , i.e.for the
static background as 1, and denote the scale for tCO as the unknown relative scale parameter
s. Then the trajectory of the moving object is 1-parameter family of possible trajectories given
by
tOS = sRTCStCO −RT
CStCS (6.2)
36
All of these trajectories satisfy the image observations, i.e.projection of the world points on the
moving object are same for all the above trajectories. This is a direct consequence of the depth
unobservability problem of monocular camera. Thus even after reconstructing a moving car, we
are not able to say whether it is a toy car moving in front of the camera, or a standard car moving
over the road. So we need to estimate this relative scale, and only when the estimated scale is
close to the true scale, the reconstruction will be meaningful. Similar to bearing only tracking
of a moving point from a monocular camera, it is impossible to estimate the true scale, without
any assumptions about the way it moves. Ozden et al . [12] exploited the increased coupling
between camera and object translations that tends to appear at false scales and the resulting
non-accidentalness of object trajectory. However their approach is mostly batch processing,
wherein trajectory data over time is reconstructed for all possible scales, and the trajectory
which for say is most planar is chosen by the virtue of it being unlikely to occur accidentally.
Also the method will only work, when we are able to reconstruct the moving object.
Unlike Ozden et al . [12], we take a different approach by employing the particle filter based
BOT on a point of the moving object to solve the relative scale problem. The state of the moving
object (i.e.position and velocity) and the associated uncertainty is continuously estimated by
the tracker and is completely represented by its set of particles. The mean of the particles
is thus the best estimate of the moving point from the filtering point of view and with the
assumptions (state transition model) made in design of the filter. When the BOT is able to
estimate the depth of a moving point upto a reasonable certainty, we can use this depth to fix
the relative scale, and get a realistic multibody reconstruction. Apart from the online nature of
the solution, the BOT can also estimate the state of an object, for which reconstruction is not
possible. Denoting the posterior depth estimate as obtained by BOT of a point on the moving
object by dBOT , and dSFM as depth of the same point as computed by the visual SLAM on
that object. The map points XO and camera poses gCO are scaled by s = dBOT /dSFM , before
being added to the integrated map.
6.2 Feedback from SfM to BOT
For the objects chosen for reconstruction, a successful reconstruction of the moving object
from the visual SLAM module can help to improve the bearing only tracking (BOT). As de-
scribed in Sec. 6.1, there exist a 1-parameter family of possible solutions for the trajectory of
a moving point. Let dSFM denote the depth of the tracked moving point from the camera in
the object frame, and diS be the depth of ith particle from the camera pose in the static world
frame. Using Eq. 6.2, the
tiS = siRTCStCO −RT
CStCS (6.3)
37
where si = diS/dSFM . Thus for a particle at particular depth, SfM on the moving object gives
a unique estimate of the particle translation. This information can be used during measure-
ment update, and also to set the motion model for the next state transition. Thus when SfM
estimates are available this can act as a secondary observation. The observation function is
then given by Eq. 6.3. The measurement update computes a distance measure between the
particle positions estimated from Eq. 6.3 and the predicted position of the particle by motion
model. Thus particles having different velocity than that estimated by the SfM, but still lying
on the projected ray can now be assigned lower weights or rejected. For the particles which
survived the resampling after this measurement update, the motion models of the particles are
set in accordance with that estimated by Eq. 6.3. Let the twist matrix corresponding to this
transformation estimate given by SfM for a particle i be denoted as ξti,SFM . The particle i is
then sampled based on the motion model given by the pdf p(gt+1iS |gtiS , ξti,SFM ), which essentially
generates a particle with mean
gt+1iS = exp(ξti,SFM∆t)gtiS (6.4)
Between two views, SfM estimate obtained from Visual SLAM module reduces the set of pos-
sible trajectories from all possible trajectories lying along the two projection rays, to an one-
parameter family of trajectories as given by Eq. 6.2.
6.3 Dynamic 3D Occupancy Map
The output of our multibody visual SLAM system can be used to generate stochastic 3D
occupancy maps, which can be used for several applications like robot navigation or path-
planning. Solving the relative scale problem enables us to create a dynamic map of the scene
containing structure and trajectory of both static and moving objects. Using the current state
and uncertainty estimate of a moving object as given by BOT module, we can find its current
position with certain probability and also predict the most likely space to be occupied by the
object in the next instant. To realize this module, we have made use of OctoMap [65], a prob-
abilistic 3D volumetric mapping library. While creating this volumetric map, we assume that
the camera poses are perfect, so the underlying math is same as discussed in Chapter 9 of [55].
With each sensor observation, we update both occupancy and non-occupancy probability of a
voxel. Whenever a new 3D feature has been triangulated and added to the list of 3D pointcloud,
a new range ray measurement with depth/range equivalent to the depth of this 3D feature is
added to the probabilistic occupancy map. All voxels till the end of this ray is considered as
unoccupied. The probabilistic mapping may eat a lot of computation, but it is fundamental to
any robust solution for post mapping work.
38
6.4 Multibody Reconstruction Results
The system has been tested on a number of publicly available real image datasets with
varying number and type of moving entities. Details of the image sequences used for experiments
are listed in Table. 6.1. The system is implemented as threaded processes in C++. The open
source libraries of TooN, OpenCV and SBA (for bundle adjustment) are used throughout the
system. Runtime of the algorithm depends on lot of factors like the number of bodies being
reconstructed, total number of independent motions being tracked by the BOT, image resolution
and bundle adjustment rules. The system runs in realtime at the average of 14Hz in a standard
laptop (Intel Core i7) compared to 1 minute per frame of [37], with up to two moving objects
being simultaneously tracked and reconstructed.
Dataset Image Resolution Trajectory Length Avg. Runtime
Moving Box [62] 320x240 718 images 20Hz
Versailles Rond [10] 760x578 700 images (400m) 7Hz
New College [48] 512x384 1500 images 13Hz
CamVid [6] 480x360 (Resized) 1600 images (0.7km) 11Hz
Table 6.1 Details of the datasets
Note: The legends (Fig. 6.1) used here and in the accompanying video are sligtly different
from figures shown in the main paper. Motion segmentation results are shown by shade in
correponding color of the convex-hull formed of the feature points segmented as independently
moving. Reconstructed 3D static world points are colored depending on height from the es-
timated ground-plane. Trajectories and structure of other moving objects are shown in some
color( red/blue). Particles of the BOT filter are shown in green. The images are best viewed
on screen.
6.4.1 Camvid Sequence
We tested our system on some dynamic parts of the CamVid dataset [6]. This a road sequence
involving a camera mounted on a moving car. The results on this sequence is highlighted in
Fig. 6.2. It shows the camera trajectory and 3D structure of static background. Reconstruction
and the 3D trajectory of a moving car in the scene as produced by the system are also shown.
Note the high degree of correlation between camera and the car trajectory, which makes it
challenging for both motion segmentation and relative scale estimation.
39
Figure 6.1 Legends used in the figures
Figure 6.2 Results on the CamVid dataset. The top image shows output of motion segmentation. Thebottom left image shows the reconstruction of the static world and a moving car at certain instant. Particlesof the BOT are shown in green and the trajectory of camera is colored red. Bottom right image also shows theestimated 3D trajectory of the moving car.
40
6.4.2 New College Sequence
We tested our results on some dynamic parts of the New College dataset [48]. Only left of the
stereo image pairs has been used. In this sequence, the camera moves along a roughly circular
campus path, and three moving persons passes by the scene. The results on this sequence are
highlighted in Fig. 6.3. It shows the map and camera trajectory with respect to the static
world and the final depth estimate from BOT. It is to be noted that, this along with most of
the sequences shown in this experiments are generally unobservable. It is only after integration
of different cues, we obtain a descent estimate of the moving object location.
Figure 6.3 Results on the New college dataset sub-sequence. The top image shows the outputof motion segmentation. The bottom image shows, the reconstructed map of the static worldand the final estimate of the position of the three moving persons detected. Particles of theBOT are shown in green.
6.4.3 Versailles Rond Sequence
Versailles Rond Sequence: This is an urban outdoor sequence [10] taken from a fast
moving car, with multiple numbers of moving objects appearing and leaving the scene. Only
left of the stereo image pairs has been used. Fig. 6.4 shows the results of the integrated map
produced by the algorithm. The middle image shows an instance of the online occupancy map,
consisting of the 3D reconstruction of two moving cars, corresponding BOT tracker and most
41
likely occupancy map of the moving objects in next instant. Bottom of Fig. 6.4 demonstrate
the reconstructed trajectory of two moving cars, shown in red and blue.
6.4.4 Moving Box Sequence
This is same sequence as used in [62]. A previously static box is being moved in front of the
camera which is also moving arbitrarily. However unlike [62], our method does not use any 3D
model, and thus can work for any previously unseen object. As shown in Fig. 6.5 our algorithm
reliably detects the moving object just on the basis of motion constraints. However, the fore-
ground moving box is nearly white and thus provides very less features for reconstruction. This
sequence also highlights the detection of previously static moving objects. Upon detection, 3D
map points lying on the moving box are deleted and their 3D coordinates are used to initialize
the BOT as described in Sec. 5.1.2.
6.5 Discussion
We have shown results for multibody visual SLAM under unobservable motion, degenerate
motion, arbitrary camera trajectory and changing number of moving entities. This is made
possible even in unobservable cases, by the integrating multiple cues from the reconstruction
pipeline. Also the algorithm is online (causal) in nature and also scales to arbitrary long
sequences.
6.5.1 Comparison of different cues to BOT
Comparison of different cues to BOT: Fig. 6.6 shows improvement in bearing only
tracking for different cues. Left graph shows the depth variance obtained for a moving car in
CamVid sequence. Since it is only tracked through BOT, no SfM cue is available. Whereas the
right figure compares the performance for 3rd moving car in Versilles Rond sequence. As seen
in Fig. 6.6, feedback from SfM has the highest effect in decreasing the uncertainty among all
cues. For a particular particle of the BOT filter, The ground-plane (GP) constraints possible
velocites to lie parallel to the plane. Whereas SfM cue to BOT restricts it to an unique velocity
vector for each depth of the particle. Depth and Size bounds can perform well even for highly
correlated motions.
6.5.2 Smooth Camera Motions
Moving object tracking from a smoothly moving camera is very challenging. It becomes
unobservable for a naive BOT, and results in very high correlation and thus rendering the
42
Figure 6.4 Results on the Versailles Rond Sequence. Top image samples some segmentation results from thesequence. The middle image shows an instance of the online occupancy map. Shaded region shows the mostlikely space to be occupied in next 16 frames (around 1s). Bottom image demonstrates the reconstruction andtrajectories of two moving cars.
43
Figure 6.5 Results for the Moving Box sequence
Figure 6.6 Comparison of different cues to the BOT namely Depth Bound (DB), Ground Plane (GP) andSfM feedback.
44
methods of [38, 12] unsuitable. Left of Fig. 6.7 shows the trajectory of the 5th moving car
of Versailles Rond sequence for three different scales. Contrary to [12], the trajectories at
wrong scales does not show any accidentalness or violation of heading constraint, which proves
its ineffectiveness for relative scale esimation from smoothly moving cameras. Typical road
scenes also involves frequent degenerate motions, making them hard even for detection. Right
image of Fig. 6.7 shows an example of degenerate motion detection, as the flow vectors on
the moving person almost move along epipolar lines, but they are being detected due to usage
the FVB constraint (Sec. 3) which gets improved by incorporating feedback from static world
reconstruction.
Figure 6.7 LEFT: Moving object trajectory for three different scales of 0.04, 0.11 and 0.18, where 0.11 (red)being the correct scale. RIGHT: Degenerate motion detection. Epipolar lines in Grey, flow vectors after rotationcompensation is shown in orange. Cyan lines show the distance to epipolar line. Moving Features detected areshown as red dots.
45
Chapter 7
Conclusion and Future Directions
7.1 Conclusion
In this thesis, we worked towards a practical vision based Simultaneous Localization and
Mapping (SLAM) system for a highly dynamic environment. Knowledge of moving objects in
the scene is of paramount importance for any autonomous mobile robot working on real world
scenerios. However as discussed in related works, there has been very little progress in that
regard. We presented a multibody visual SLAM system which is an adaptation of multibody
SfM theory in similar lines as visual SLAM is for standard offline batch SfM theory. We were
able to obtain a fast incremental multibody reconstruction across long real-world sequences.
Many real world dynamic scenes involving smoothly moving monocular camera and scenes with
high correlation between camera and moving object motion (e.g road scenes involving moving
cars), which are considered unobservable even in the latest state-of-the-art systems, can now
be reconstructed with reasonable accuracy. We introduce a novel geometric constraint in two
views, capable of detecting moving objects followed by a moving camera in same direction, a
so-called degenerate configuration where the commonly used epipolar constraint fails. This is
made possible by exploiting the knowledge of camera motion to estimate a bound in image
feature position along the epipolar line. A probability framework propagates the uncertainties
in the system and recursively updates the probability of a feature being stationary or dynamic.
The different modules of motion segmentation, visual SLAM and moving object tracking were
integrated and we presented, how each module helps the other one. We present a particle filter
based BOT algorithm, which integrates multiple cues from the reconstruction pipeline. The
integrated system can simultaneously perform realtime multibody visual SLAM, tracking of
multiple moving objects and unified representation of them using only a single monocular cam-
era. The work presented here can find immediate applications in various robotics applications
involving dynamic scenes.
46
7.2 Future Work
I feel that the work on “multibody Visual SLAM” is still in its nascent stages and a lot of
work needs to be done in this area by the robot vision community. I hope to move in that
direction during my PhD studies. Some of the immediate improvents that can be applied to
the system are as the following.
In this thesis, we have limited ourselves to multiview geometric constraints. This cues are
complementary to semantic information like appearance, object detection/categorization cues
which can be exploited for more richer and better information about the scene. For e.g .if we
can perform object detection over this, more complex or object specific motion models (e.g
non-holonomic constraints for car) can be used. Also object categories can be used for a-priori
known object size which can reduce other uncertainities.
Ess et al . [14, 13] describes a mobile vision system based on a stereo camera which makes use
of appearance-based object detection in a tracking-by-detection framework to track multiple
pedestrians in a highly dynamic and challenging environment. Also recently, [3] combined
semantic information with traditional SfM to obtain a better description of the environment.
One important extension is a more elegant and compact representation of the whole multi-
body Visual SLAM problem in a graphical model framework. A proper model, will allow us to
integrate different modules and cues in more elegant and efficient manner and can also benefit
from recent advances in inferencing algorithms from machine learning community.
There is also a need for improvement in reconstruction of small moving objects like car
or person. Standard feature tracking often does not provide enough feature tracks for proper
reconstruction. An adaptive model based tracking like [25] can lead to a better results.
47
Bibliography
[1] S. Avidan and A. Shashua. Trajectory triangulation: 3D reconstruction of moving points
from a monocular image sequence. PAMI, 22(4):348–357, 2002.
[2] T. Bailey and H. Durrant-Whyte. Simultaneous localization and mapping (SLAM): Part
II. IEEE Robotics & Automation Magazine, 13(3):108–117, 2006.
[3] S. Bao and S. Savarese. Semantic structure from motion. CVPR, 2011.
[4] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. pages 404–417,
2006.
[5] T. Brehard and J. Le Cadre. Hierarchical particle filter for bearings-only tracking. IEEE
TAES, 43(4):1567–1585, 2008.
[6] G. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-
definition ground truth database. PRL, 30(2):88–97, 2009.
[7] T. Brox, C. Bregler, and J. Malik. Large displacement optical flow. In CVPR, 2009.
[8] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary Robust Independent
Elementary Features. In ECCV. Springer, 2010.
[9] J. Civera, A. Davison, and J. Montiel. Inverse depth parametrization for monocular SLAM.
IEEE Transactions on Robotics, 24(5):932–945, 2008.
[10] A. Comport, E. Malis, and P. Rives. Real-time Quadrifocal Visual Odometry. IJRR,
29(2-3):245, 2010.
[11] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera
SLAM. PAMI, 29(6):1052–1067, 2007.
[12] K. Egemen Ozden, K. Cornelis, L. Van Eycken, and L. Van Gool. Reconstructing 3D
trajectories of independently moving objects using generic constraints. CVIU, 96(3):453–
471, 2004.
[13] A. Ess, B. Leibe, K. Schindler, and L. V. Gool. Robust multi-person tracking from a
mobile platform. PAMI, 31(10):1831–1846, 2009.
48
[14] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Moving obstacle detection in highly
dynamic scenes. In ICRA, 2009.
[15] O. Faugeras, Q. Luong, and T. Papadopoulo. The geometry of multiple images. MIT press,
2001.
[16] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with
applications to image analysis and automated cartography. Communications of the ACM,
24(6):381–395, 1981.
[17] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, 2004.
[18] M. Irani and P. Anandan. A unified approach to moving object detection in 2D and 3D
scenes. PAMI, 20(6):577–589, 1998.
[19] M. Irani, B. Rousso, and S. Peleg. Recovery of ego-motion using region alignment. PAMI,
19(3):268–272, 1997.
[20] B. Jung and G. Sukhatme. Real-time motion tracking from a mobile robot. International
Journal of Social Robotics, 2(1):63–78, 2010.
[21] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In
ISMAR, 2007.
[22] K. Konolige and M. Agrawal. Frameslam: From bundle adjustment to real-time visual
mapping. IEEE Transactions on Robotics, 24(5):1066–1077, 2008.
[23] J. Kwon and K. Lee. Monocular SLAM with Locally Planar Landmarks via Geometric
Rao-Blackwellized Particle Filtering on Lie Groups. In CVPR, 2010.
[24] J.-P. Le Cadre and O. Tremois. Bearings-only tracking for maneuvering sources. IEEE
TAES, 34(1):179 –193, 1998.
[25] M. Leotta and J. Mundy. Vehicle surveillance with a generic, adaptive, 3d vehicle model.
PAMI, 33(7):1457 –1469, 2011.
[26] V. Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: An accurate o (n) solution to the pnp
problem. IJCV, 81(2):155–166, 2009.
[27] K. Lin and C. Wang. Stereo-based Simultaneous Localization, Mapping and Moving Object
Tracking. In IROS, 2010.
[28] M. Lourakis, A. Argyros, and S. Orphanoudakis. Independent 3D Motion Detection Using
Residual Parallax Normal Flow. In ICCV, 1998.
[29] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal
of Computer Vision (IJCV), 60(2):91–110, 2004.
49
[30] Y. Ma, S. Soatto, and J. Kosecka. An invitation to 3-d vision: from images to geometric
models. Springer Verlag, 2004.
[31] D. Migliore, R. Rigamonti, D. Marzorati, M. Matteucci, and D. G. Sorrenti. Avoiding
moving outliers in visual SLAM by tracking moving objects. In ICRA’09 Workshop on
Safe navigation in open and dynamic environments, 2009.
[32] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real time localization
and 3d reconstruction. In CVPR, 2006.
[33] J. Neira, A. Davison, and J. Leonard. Guest editorial, special issue in visual slam. IEEE
T-RO, 24(5):929–931, 2008.
[34] R. Newcombe and A. Davison. Live dense reconstruction with a single moving camera. In
CVPR, 2010.
[35] D. Nister. An efficient solution to the five-point relative pose problem. PAMI, 26(6):756–
770, 2004.
[36] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. In CVPR, 2004.
[37] K. E. Ozden, K. Schindler, and L. V. Gool. Multibody structure-from-motion in practice.
PAMI, 32:1134–1141, 2010.
[38] H. S. Park, I. Matthews, and Y. Sheikh. 3d reconstruction of a moving point from a series
of 2d projections. In ECCV, 2010.
[39] S. Pundlik and S. Birchfield. Motion segmentation at any speed. In Proceedings of British
Machine Vision Conference (BMVC), 2006.
[40] S. Rao, A. Yang, S. Sastry, and Y. Ma. Robust Algebraic Segmentation of Mixed Rigid-
Body and Planar Motions from Two Views. IJCV, 2010.
[41] E. Rosten, R. Porter, and T. Drummond. Faster and better: A machine learning approach
to corner detection. PAMI, 32:105–119, 2010.
[42] H. Sawhney. 3D geometry from planar parallax. In Computer Vision and Pattern Recog-
nition, 1994.
[43] H. Sawhney, Y. Guo, and R. Kumar. Independent motion detection in 3D scenes. PAMI,
22(10):1191–1199, 2000.
[44] K. Schindler and D. Suter. Two-view multibody structure-and-motion with outliers
through model selection. PAMI, 28(6):983–995, 2006.
[45] G. Schweighofer and A. Pinz. Robust pose estimation from a planar target. PAMI, pages
2024–2030, 2006.
[46] J. Shi and C. Tomasi. Good features to track. In CVPR, pages 593–600, 1993.
50
[47] G. Sibley, L. Matthies, and G. Sukhatme. A Sliding Window Filter for Incremental SLAM.
Unifying Perspectives in Computational and Robot Vision, pages 103–112, 2008.
[48] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman. The new college vision and
laser data set. IJRR, 28(5):595, 2009.
[49] J. Sola. Towards visual localization, mapping and moving objects tracking by a mobile
robot: a geometric and probabilistic approach. PhD thesis, LAAS, 2007.
[50] H. Strasdat, J. Montiel, and A. Davison. Real-Time Monocular SLAM: Why Filter? In
ICRA, 2010.
[51] H. Strasdat, J. Montiel, and A. Davison. Scale Drift-Aware Large Scale Monocular SLAM.
In RSS, 2010.
[52] K. Strobl and G. Hirzinger. Optimal hand-eye calibration. In IROS, 2006.
[53] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and their principles.
In CVPR, 2010.
[54] S. Thrun. Robotic mapping: A survey. In Exploring Artificial Intelligence in the New
Millenium. Morgan Kaufmann, 2002.
[55] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. 2005. MIT Press.
[56] C. Tomasi and T. Kanade. Detection and tracking of point features. 1991.
[57] L. Valgaerts, A. Bruhn, M. Mainberger, and J. Weickert. Dense versus sparse approaches
for estimating the fundamental matrix. International Journal of Computer Vision (IJCV),
pages 1–23.
[58] R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody structure from motion.
IJCV, 68(1):7–25, 2006.
[59] C. Wang. Extrinsic calibration of a vision sensor mounted on a robot. IEEE Trans. Robotics
and Automation, 8(2):161–175, 1992.
[60] C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte. Simultaneous local-
ization, mapping and moving object tracking. IJRR, 26(9):889–916, 2007.
[61] J. Wang and E. Adelson. Layered representation for motion analysis. In CVPR, 1993.
[62] S. Wangsirpitak and D. Murray. Avoiding moving outliers in visual slam by tracking
moving objects. In ICRA, 2009.
[63] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof. Anisotropic
Huber-L1 optical flow. In Proceedings of the British Machine Vision Conference (BMVC),
September 2009.
[64] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image
by bayesian combination of edgelet part detectors. 2005.
51
[65] K. M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard. OctoMap: A
probabilistic, flexible, and compact 3D map representation for robotic systems. In ICRA
2010 Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation,
2010.
[66] C. Yuan, G. Medioni, J. Kang, and I. Cohen. Detecting motion regions in the presence
of a strong parallax from a moving camera by multiview geometric constraints. PAMI,
29(9):1627–1641, 2007.
[67] Z. Zivkovic and B. Krose. Part based people detection using 2D range data and images.
In IROS, 2007.
52