A markerless augmented reality system using one-shot structured light
Transcript of A markerless augmented reality system using one-shot structured light
Rowan University Rowan University
Rowan Digital Works Rowan Digital Works
Theses and Dissertations
12-3-2015
A markerless augmented reality system using one-shot structured A markerless augmented reality system using one-shot structured
light light
Bingyao Huang
Follow this and additional works at: https://rdw.rowan.edu/etd
Part of the Electrical and Computer Engineering Commons
Let us know how access to this document benefits you - share your thoughts on our feedback form.
Recommended Citation Recommended Citation Huang, Bingyao, "A markerless augmented reality system using one-shot structured light" (2015). Theses and Dissertations. 4. https://rdw.rowan.edu/etd/4
This Thesis is brought to you for free and open access by Rowan Digital Works. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Rowan Digital Works. For more information, please contact [email protected].
A MARKERLESS AUGMENTED REALITY SYSTEM USING ONE-SHOTSTRUCTURED LIGHT
by
Bingyao Huang
A Thesis
Submitted to theDepartment of Electrical & Computer Engineering
College of EngineeringIn partial fulfillment of the requirement
For the degree ofMaster of Science in Electrical & Computer Engineering
atRowan University
June 30, 2015
Thesis Chair: Ying Tang
© 2015 Bingyao Huang
Acknowledgments
I would like to express my sincere gratitude to my advisor Dr. Ying Tang for the
continuous support of my Masters study and research, for her patience, motivation, enthu-
siasm, and immense knowledge. She taught me how to think critically, question thoughts
and express ideas. Her guidance helped me in all the time of research and writing of this
thesis. Besides the care in study and research, her generous support helped me pass through
the hard times when I was trying to adapt to a new life in this country. I could not have
imagined having a better advisor and mentor for my Masters study. I hope that one day I
would become as good an advisor to my students as Dr. Tang has been to me.
Besides my advisor, I would like to thank the rest of my thesis committee: Dr.
Haibin Ling and Dr. Ravi Ramachandran for their encouragement, insightful comments,
and practical advice.
My sincere thanks also goes to Mr. Karl Dyer for the generous help on the 3D-
printed camera mount and slider, without his help I would have not finished the system test
and data collection.
I am also grateful to my friend Christopher Franzwa, his support and care helped me
overcome setbacks when I first came this country. I greatly value his friendship.
I would also like to thank my family for the support they provided me through my
entire, without your love, encouragement and editing assistance, I would not have finished
this thesis.
iii
Abstract
Bingyao Huang
A MARKERLESS AUGMENTED REALITY SYSTEM USING ONE-SHOTSTRUCTURED LIGHT
2014-2015Ying Tang, Ph.D.
Master of Science in Electrical & Computer Engineering
Augmented reality (AR) is a technology that superimposes computer-generated 3D
and/or 2D information on the user’s view of a surrounding environment in real-time, en-
hancing the user’s perception of the real world. Regardless of the field for which the ap-
plication is applied, or its primary purpose in the scene, many AR pipelines share what
might be thinned down to two specific goals, the first being range-finding the environment
(whether this be in knowing a depth, precise 3D coordinates, or a camera pose estimation),
and the second being registration and tracking of the 3D environment, such that an envi-
ronment moving with respect to the camera can be followed. Both range-finding and track-
ing can be done using a black and white fiducial marker (i.e., marker-based AR) or some
known parameters about the scene (i.e., markerless AR) in order to triangulate correspond-
ing points. To meet users’ needs and demand, range-finding, registration and tracking must
follow certain standards in terms of speed, flexibility, robustness and portability. In the past
few decades, AR has been well studied and developed to be robust and fast enough for real-
time applications. However, most of them are limited to certain environment or require a
complicated offline training. With the advancement of mobile technology, users expect AR
to be more flexible and portable that can be applied in any uncertain environment. Based
on these remarks, this study focuses on markerless AR in mobile applications and proposes
an AR system using one-shot structured light (SL). The markerless AR system is validated
iv
in terms of its real time performance and ease of use in unknown scenes.
v
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Research Background and Objectives . . . . . . . . . . . . . . . . . . . . . . . . 1
Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Display devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Virtual contents registration and rendering . . . . . . . . . . . . . . . . . . . 8
Marker-based AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Markerless-based AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3D Object Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
System calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Acquiring 2D images from different views . . . . . . . . . . . . . . . . . . . 12
Finding correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Surface construction and optimization . . . . . . . . . . . . . . . . . . . . . 12
Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Active range sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
vi
Table of Contents (Continued)
Structured Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Markerless Tracking and Pose Estimation . . . . . . . . . . . . . . . . . . . . . 18
Feature detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Feature description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: Fast 3D Reconstruction using One-Shot Spatial Structured Light . . . . . 32
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3D Reconstruction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Color-coded structured light . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Image processing for grid extraction . . . . . . . . . . . . . . . . . . . . . . 36
Color labeling and correction . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Neighbor completion using hamming distance . . . . . . . . . . . . . . . . . 39
Chapter 4: Markerless Tracking and Camera Pose Estimation . . . . . . . . . . . . . 42
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
System Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Feature Based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Camera Tracking and Pose Estimation Using Perspective-n-Points . . . . . . . . 49
Chapter 5: Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii
Markerless Tracking and Pose Estimation . . . . . . . . . . . . . . . . . . . . . 55
Processing Speed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 6: Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 62
Discussion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter A:Hamming-Distance-Based Amendment . . . . . . . . . . . . . . . . . . . 77
viii
List of Figures
Figure Page
Figure 1. HMD displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 2. Marker-based AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 3. Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 4. Range data scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 5. Corner definition by Harris [44] . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 6. Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 7. RANSAC used to filter matching ourliers . . . . . . . . . . . . . . . . . . 31
Figure 8. System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 9. Color Grid Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 10.The proposed markerless AR system . . . . . . . . . . . . . . . . . . . . 42
Figure 11.Converting from object to camera coordinate systems . . . . . . . . . . . . 44
Figure 12.System setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 13.SL reconstruction process . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 14.Reconstructed point cloud and mesh . . . . . . . . . . . . . . . . . . . . . 55
Figure 15.Camera tracking and pose estimation using PnP . . . . . . . . . . . . . . . 57
Figure 16.Accuracy of the proposed markerless pose estimation . . . . . . . . . . . . 59
Figure 17.Frame rate (fps) of the proposed method presented . . . . . . . . . . . . . 60
Figure 18.Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
ix
List of Tables
Table Page
Table 1. Camera and projector intrinsic parameters . . . . . . . . . . . . . . . . . . 52
Table 2. Camera-projector pair rotation matrix . . . . . . . . . . . . . . . . . . . . 52
Table 3. Camera-projector pair translation vector . . . . . . . . . . . . . . . . . . . 54
Table 4. Accuracy of the markerless pose estimation . . . . . . . . . . . . . . . . . 58
.
x
Chapter 1
Introduction
Research Background and Objectives
We live in a three-dimensional spatial world that is perceived by our vision system, where
the objects we see is tangible and real. With the development of technology, digital in-
formation is brought to human, however, this kind of intangible virtual information is less
intuitive for human to understand due to the limitation of human vision system and brain.
Interfaces and tools have been invented and developed in an attempt to bring us closer to
the exponentially increasing digital information, one of which is Augmented Reality (AR)
that brings the real and the virtual together (i.e., computer generated graphics/video). AR
has many applications in our life. Regardless of the field for which the application is ap-
plied, or its primary purpose in the scene, many AR pipelines share what might be thinned
down to two specific goals, the first being range-finding the environment (whether this be in
knowing a depth, precise 3D coordinates, or a camera pose estimation), and the second be-
ing registration and tracking of the 3D environment, such that an environment moving with
respect to the camera can be followed. To meet users’ need, the design of range-finding,
registration and tracking must consider speed, flexibility, robustness and portability. In the
past few decades, AR has been well studied and developed to be robust and fast enough for
real-time applications. But its flexibility and portability face a big challenge as most of the
AR development is either limited to a certain environment or requires complicated offline
training. As mobile devices become popular, users expect a more flexible and portable AR
that can be applied in any uncertain environment. Based on these remarks, this thesis fo-
1
cuses on AR in mobile applications. In particular, we proposes a 3D reconstruction (range-
finding) method using structured light (SL). We then apply this method to our proposed
AR system that allows fast one-time range-finding and accurate tracking in real-time, and
easy-of-use in arbitrary unknown scenes without any offline training and/or add-on sensors
needed.
Research Contribution
The goal of this research is to provide proof-of-the-concept of a markerless AR system that
can be performed in real-time (30Hz) on mobile devices. In particular, the system attempts
to accomplish the range-finding component without using a priori or comparative data set
of the scene, but rather a known projection of structured light (SL). By doing this, range-
finding can be accomplished for any unknown environment, requiring only one instance
of data regarding the scene. In addition, the proposed method can be easily embedded
on camera-equipped (with a flash light) mobile devices. Our specific contributions are as
follows.
This work is primarily a proof-of-concept that robust markerless AR can be performed
in real-time (30 Hz) on mobile devices. Our main contributions are as follows:
• We propose a flexible and portable AR system that can be used in any unknown
scenes with no offline training.
• We propose an SL-based range finding method to reconstruct exact 3D information
of the real world
• We demonstrate that a mobile device with a camera and flash light through a film is
suffice to apply our real-time markerless AR system
2
Organization
This thesis is organized as follows. Chapter 2 describes the related work and techniques
in AR with a focus on markerless (mobile) AR. The general 3D reconstruction process
of our proposed AR system is presented in Chapter 3, followed by the detail on tracking,
camera pose estimation and graphics rendering in Chapter 4. Chapter 5 gives the testing
and evaluation of the proposed system in comparison with the ground truth. Finally, the
conclusions and future research directions are presented in Chapter 6.
3
Chapter 2
Literature Review
Augmented Reality
Augmented reality (AR) superimposes computer-generated 3-D graphics or 2-D informa-
tion on the user’s view of a surrounding environment in real-time, enhancing the user’s
perception of the real world. AR is both interactive and registered in 3D as well as com-
bines real and virtual objects [17]. In the definition of Milgram’s Reality-Virtuality Con-
tinuum [71] defined by Paul Milgram and Fumio Kishino there is a continuum that spans
between the real environment and the virtual environment comprise Augmented Reality
and Augmented Virtuality (AV) in between, where AR is closer to the real world and AV is
closer to a pure virtual environment.
Another definition of AR is given by Azuma et al. [17] that AR is not restricted to a
particular type of display technologies such as head-mounted display (HMD), nor limited
to the sense of sight. AR can potentially apply to all senses, augmenting smell, touch and
hearing, etc. With this context, our thesis only focuses on computer vision-based AR that
estimates camera pose (localization, orientation) using visual tracking and view geometry
constraints without any add-on sensors other than a camera. 3D virtual objects are then
rendered from the same viewpoint that the images of the real scene are being taken by the
tracking camera. The term of AR in the rest of this thesis is referred as computer vision-
based Augmented Reality.
The goal of an AR application could be to complement existing information in the
scene, such as overlaying augmented text on an historic building [54, 114] or giving more
4
Figure 1. HMD display.
educational information in a museum [6, 18, 19] or a book [23, 55]. AR also finds its ap-
plications in image guided surgery [33, 43, 58], reverse engineering [32], navigation [47],
advertising and commercial [2, 48]. Another purpose could be to simply draw attention to
the virtual data such as an entertainment game [11, 96]. Among all the applications, there
are three essential key factors to account for a reliable and flexible AR system: display
devices on which virtual contents are shown to users; tracking and camera pose estimation
mechanism such that an environment moving with respect to the camera can be followed;
and virtual content registration and rendering for proper display of the augmented informa-
tion on devices.
Display devices. The display technologies can be grouped into three categories:
head-mounted display (HMD), projective and handheld.
HMDs are wearable display devices on the head or as part of a helmet that places
the real scene and images of virtual environment over a user’s view of the world (Fig. 1).
HMD can either be video-see-through or optical see-through with a monocular or binocular
5
display. Video see-through HMDs provide users with a video view of both real world and
virtual graphics. This technology requires at least one digital camera that captures the
real scene and passes it through the video display to users. Given that the real and virtual
contents are both processed as digital images, there are no synchronization issues. The
optical see-through employs an optical mirror to allow views of physical world to be passed
through the lens and graphically overlay information to be reflected in the user’s eyes, e.g.,
Google Glass. Since the optical see-through physically display real scenes with digitally-
displayed virtual contents, synchronization of both displays is a technique challenge.
Projective displays project virtual information directly onto a real object with either
video-projectors, for example, ARBlocks [86], optical elements, holograms and radio fre-
quency tags, etc., or other tracking technologies. One of the potential advantages of projec-
tive displays is that they can be used without a wearable display device. But the down side
is that the necessary set-up makes it inconvenient for outdoor uses. Another shortcoming
with this type of displays is that cameras and projectors are difficult to synchronize under
different lighting conditions.
Handheld displays are typically mobile devices such as smart phones, tablets and watches.
The devices are usually attached with a camera or paired with other image acquisition and
processing devices to stream videos to the display. Handheld displays are highly suitable
for mobile AR applications due to their portability, mobility, ubiquity and low cost, owing
to the development of the mobile computing technology recently [24]. They use video-see-
through techniques to superimpose virtual contents onto the real environment and employ
built-in sensors, such as digital compasses, accelerometer, gyroscope, barometer and/or
GPS units to help and enhance their six degree of freedom (DoF) tracking. With the devel-
6
opment of digital camera, high definition screen such as retina display [111], and powerful
CPU, GPU and RAM the smart phones and tablets are capable of real-time AR applications
with high definition displays. However, their small display is the limitation for interactive
3D user experience. A promising solution is to combine HMDs with handheld devices,
so the HMDs serve as image acquisition and displays while handheld devices serve for
computing and processing server.
Camera pose estimation. Precise geometric registration of virtual contents con-
vinces observers to accept the virtual content as part of the real environment rather than
separated or overlaid illusion of information. In most AR applications, this registration
requires a correct and real-time estimation of the user’s viewpoint with respect to the co-
ordinate system of the virtual contents [39]. Namely, the camera pose with respect to the
real scene must be known at each frame to be rendered. The camera pose can be estimated
using an appropriate 6-Dof tracking system, where a 3D rigid trans- formation of the cam-
era localization in the scene is calculated, including 3D rotation (3-Dof) and 3D translation
(3-Dof) in the world coordinate system. This process usually involves 3D reconstruction
(or range-finding) and visual tracking. Given that the camera pose with respect to the real
scene changes over time, camera pose estimation must be performed fast enough to support
real-time tracking.
In sensor based AR, optical(laser, IR), accelerometer, gyrometer, GPS, magnetic, ul-
trasonic sound techniques are applied/combined to enable a reliable tracking mechanism.
However, in computer vision based AR, the relative localization between the camera and
real scene in computer vision-based AR changes in real-time. Thus, a camera pose estima-
tion algorithm must be performed fast to meet real-time tracking requirement.
7
Virtual contents registration and rendering. This process consists of generating
and superimposing virtual contents using the estimated camera pose in real-time. The vir-
tual contents may be modeled offline, for example, characters in an AR game, or generated
online, such as, the wearable outdoor navigation application [70].
Out of the aforementioned three key factors, our thesis primarily focuses on tracking
and camera pose estimation in unknown circumstances. Range-finding and tracking can be
done using a fiducial marker or some natural features or parameters about the scene in order
to triangulate corresponding points. The former method is referred to as a marker-based
AR system while the latter is known as a markerless system.
Marker-based AR. The use of fiducial markers, although important to provide a
robust tracking mechanism, presents a couple of drawbacks. The major restriction is that
markers often cannot be attached to objects that are supposed to be augmented. While this
problem might be solved using more than two cameras (at least one for markers tracking
and the other for image taking), this significantly reduces the overall tracking and regis-
tration quality. In addition, such an approach is not suitable for most mobile devices like
mobile phones. For instance, the use of markers may pose sometimes unwanted objects on
the outputting video frames, which can reduce the visual aesthetic or get in the way of a
desired view, particularly in mobile devices where resolution and/or field of view is limited.
Finally, applying markers at all target locations to be augmented can be time-consuming
like a large scale scene or even impossible, for example, in outdoor intangible scenarios.
Thus, in a mobile scenario markerless tracking approaches using natural features of the
environment to be augmented for tracking are preferable.
8
Figure 2. Marker-based AR example generated using ARtoolkit [51].
Markerless-based AR. Existing markerless AR applications extract pre-stored spa-
tial information from objects followed by mapping invariant feature points to calculate the
pose [26, 65, 99]. Strong feature detectors are usually used to identify target points and
the random sample consensus (RANSAC) [38] algorithm is then employed to validate the
matching of feature points. Once the features are matched, a perspective transformation
is used to estimate camera’s translation and rotation poses with respect to the object from
frame to frame. Markerless pose trackers mainly rely on natural feature points (sometimes
called keypoints or interest points) visible in the user’s environment, so its real-time camera
pose estimation becomes challenging.
A perusal of current literature provides a number of prominent markerless AR tech-
niques, which can be categorized into three types: structure from motion (SfM), Parallel
Tracking and Mapping (PTAM) and model-based systems.
SfM tries to work on a completely unknown scene and uses motion analysis and some
known camera parameters to calculate the camera pose [4, 31]. Using MonoSLAM [31] as
9
an example, it applies SfM to do simultaneous localization and mapping (SLAM), where
the probabilistic 3D map is initialized using a planar target containing four known features
and then dynamically updated by using Extended Kalman Filter (EKF).
PTAM system [56] uses a stereo pair of images of a planar scene for initialization,
and bundle adjustment is then applied to determine and update the 3D locations of interest
points.
Model-based methods are considered robust against noise and efficient that make use
of models of the tracked objects, such as CAD models or 2D templates of items with
distinguishable features [120] (e.g., lines, circles, cylinders and spheres) [27, 119]. Once a
connection is made between the 2D image and 3D world frame, it is possible to calculate
the camera pose by projecting the features coordinates in 3D to 2D image coordinates.
An optimal pose estimation is found by minimizing the distance between the projected
2D coordinates and the corresponding observed 2D features coordinates. [27] describes a
real-time 3D model-based tracking algorithm for a ”video see through” monocular vision
system. This model-based method introduces an tracking algorithm that uses implicit 3D
information to predict hidden movement of the object thus improving the robustness and
performance. [115] proposes an illumination insensitive model-based 3D object tracking
system, where a Kalman filter is used to recursively refine the position and orientation. [85]
applies a hybrid (edge-based and texture-based) tracking system for out-door augmented
reality, where the system operates at about 15-17 frames on a handheld device. Recently,
[106] shows deformable surfaces like paper sheets, t-shirts and mugs can also serve as
potential augmentable surfaces. However, there is still space to improve its robustness and
minimize the computational costs.
10
These methods accomplish tracking in parallel with range-finding. They either require
an initialization stage for the 3D model of the target object or require additional hardware
for image processing to achieve real-time AR. For instance, [27] acquires the 3D model
of the target object in priori, making it impossible to work in uncertain environment; [4]
uses a server for online image processing and 3D point cloud reconstruction and Wi-Fi
communication with mobile device. In short, all of the methods need at least some amount
of the spatial environment information to be known either a-priori, or in a sequence of
video frames, to accomplish the actual range-finding.
Several attempts have been made to improve the current markerless AR techniques for
mobile applications. However the performance is still not sufficient enough. A modified
version of SURF is proposed in [22], but far from real-time on mobile devices. Wagner et
al propose a feature detection and matching system for mobile AR [109], where a modi-
fied version of SIFT and outliers rejection algorithms speed up the computation to achieve
approximately 26 FPS. An AR application has been adapted to iPhone [57], where a re-
duced version of the PTAM is adopted. However, the method has shown severely reduced
accuracy and low execution speed. [103] proposes a markerless augmented reality system
for mobile devices with only a monocular camera. To enable a correct 3D-2D match, the
EPnP [61] technique is used inside a RANSAC loop and inlier tracking uses optical flow.
3D Object Reconstruction
A prerequisite to estimate camera pose is the knowledge of the 3D coordinates of a ref-
erence object. This can be done through 3D reconstruction (rang-finding), a process of
capturing the real shape and appearance of the object. 3D reconstruction has wide applica-
11
tions in medical and industrial fields, and has also been largely used to generate models for
virtual and augmented reality. The process usually involves five major steps:
System calibration. A 3D reconstruction system consists of at least one digital cam-
era. To properly register the point cloud, the transformation between image frame, camera
frame and world frame must be known, which is expressed as camera intrinsic and extrin-
sic parameters. The process of obtaining those camera intrinsic and extrinsic parameters is
called calibration.
Acquiring 2D images from different views. To completely reconstruct the surface
of the entire object, often 8-10 images of the object from different views are needed.
Finding correspondence. A common method to get range data is to compare the 2D
images and find the correspondences between each feature point through feature matching
under camera poses constraints.
Registration. Given the perspective transformations from each view to object, the
3D coordinates are uniquely determined by its 2D correspondences. This is often com-
pleted by triangulation. Then the range data is transformed and combined together to form
a single point cloud in the 3D world coordinate system. The whole process is called regis-
tration.
Surface construction and optimization. Once the point cloud is created, a con-
nected mesh can be produced by connecting the 3D points to form edges. There are many
well-established algorithms to do so, such as, marching cube algorithm [67], Poisson Sur-
face Reconstruction [53] or the one in [46] that uses the neighborhood relationship of ex-
isting points.
In the rest of this section, three main 3D reconstruction methods are reviewed: stereo
12
vision, time of flight and structured light
Stereo Vision. Stereo vision system consists of two or more calibrated cam- eras
capturing digital images of objects from different views. The captured images share some
common object information, whose sparse correspondences or dense correspondences are
used to match the prominent parts, namely corners, edges, or patch of the object, for tri-
angulation. The former one is called feature-based matching, which first extracts a set
of potentially matchable image locations using either interest operators or edge detectors,
and then searches for corresponding locations in other images using patch-based metrics.
More recent work in this area has focused on the use of extracted, highly reliable features as
seeds to grow additional matches [63]. Similar approaches have also been extended to wide
baseline multi-view stereo problems and combined with 3D surface reconstruction [42,64].
While sparse matching is still occasionally used, most stereo vision today focuses on dense
correspondence, since it is required for many applications such as image-based rendering
or modeling. Such dense correspondence, often called block matching, consists of the fol-
lowing four steps: matching cost/score computation for each block pair, cost minimization
(score optimization), disparity assignment according to the minimal cost match and dis-
parity refinement. Usually the matching cost is the sum-of-squared-differences (SSD) or
correlation be- tween two blocks from two images (Fig.3). To get a better result, global op-
timization algorithms are studied, such as probabilistic (mean-field) diffusion [92], expecta-
tion maximization (EM) [10], graph cuts [12], or loopy belief propagation [35,37,104,110].
Those algorithms view a digital image as a graph model and make explicit smoothness as-
sumptions that the image is smooth between two adjacent pixels. Finally, the matching is
solved as a global optimization problem.
13
Figure 3. Block matching of Tsukuba dataset. Tsukuba dataset obtained from [93].
Active range sensing. Stereo vision uses passive video cameras, so it is also called
passive 3D reconstruction. When applied to a scene in poor light condition or with tex-
tureless objects, for example, a plaster bust, this method usually has difficulty to find and
match features, even with the epipolar constraint. Active 3D reconstruction, on the other
hand, tackles this problem well. It uses light source to add an artificial pattern or texture
on objects, thus greatly improving the precision of feature matching. This active illumina-
tion technique has been successfully applied to 3D laser scanning [34] and structured light
scanning [104] with high reliability.
One of the active methods is called time-of-flight that measures the time a light signal
flies between the light sensor and objects of interest. The theory behind the method is
simple. When the illumination source is switched on for very short time and sends out a
single light pulse or pulses, the resulting light pulse reaches the scene and is reflected by
the objects in the field of view. The sensor then captures the reflected light. Depending
upon the distance between illumination source and the object, there is a delay, T , for the
camera to capture the reflected light, which is a function of the distance, D, and the speed
14
of light, c:
T = 2×D/c
Thus, the distance is only related to the delay and a constant c
D = Tc/2
where T is the delay and D is the distance from the illumination source to the object.
A good example of this system is LIDAR (LIght Detection And Ranging) [7], similar
to RADAR(RAdio Detection And Ranging), a remote sensing method that uses a spatially
narrow laser beam to measure ranges. To account for sensor noise, [28] describes an in-
tegrated method that combines a 3D super-resolution with a probabilistic scan alignment
approach.
Velten et al. [107] proposes a time-of-flight 3D reconstruction method that use diffusely
reflected light, the authors claim it achieves sub-millimeter depth precision and centimeter
lateral precision over 40 cm×40 cm×40 cm of hidden space.
Another popular method uses a laser or light stripe sensor to sweep a plane of light
across the scene or objects while observing it from an offset viewpoint [29]. A stripe
is observed as an intersection between the light plane and object surface. As the stripe
sweeps across the object, it deforms its shape according to the object’s surface shape. It is
then a simple matter of using optical triangulation to estimate the 3D locations of all the
points seen in a particular stripe. (Fig.4)
Structured Light. Structured light 3D reconstruction method is known for its ro-
bustness against noise and correspondence unambiguity. Instead of using a single video
15
Figure 4. Range data scanning. Figure obtained from [25].
camera, structured light allows an active illumination sensor (e.g. projector) to project en-
coded stripe patterns onto objects that create synthetic texture. Structured light follows the
same routine as stereo systems if considering the active illumination sensor(e.g. projector)
as an inverse camera. In terms of the correspondence problem in stereo vision based re-
construction, the synthetic texture helps match textureless surfaces, thus producing a better
correspondence of features captured by two cameras.
Projecting known encoded patterns makes the correspondence between pixels unam-
biguous and is robust against occlusions [94]. This technique has been used to produce
large numbers of highly accurate registered multi-image stereo pairs and depth maps for
the purpose of evaluating stereo correspondence algorithms and learning depth map priors
and parameters [98].
3D reconstruction using structured light has been well studied in the past few decades
due to its wide applications in reverse engineering [82], augmented reality [100], medical
16
imaging [78] and archaeological finds [69]. In terms of codification strategy structured light
can be classified into four general types: discrete spatial multiplexing, time-multiplexing,
continuous frequency multiplexing and continuous spatial multiplexing [89]. When con-
sidering the projections, structured light can be sequential (multiple-shot) or single-shot.
More detail can be referred to [41].
Most of the early work focuses on a temporal approach that requires multiple patterns
to be projected consecutively onto stationary objects. Obviously, such requirement makes
temporal approaches unsuitable for mobile and real-time applications.
Recently, researchers have devoted lots of efforts in speeding up the data acquisition
process by designing techniques that need only a handful of input images or even a single
one, so-called ”one-shot” 3D image acquisition. For example, Zhang et al. [117] pro-
pose a multi-pass dynamic programming algorithm to solve the multiple hypothesis code
mating problem and successfully apply to both one-shot and spacetime methods. Ali et
al. [104] model a spatial structure light system using a probabilistic graphical formulation
with epipolar, coplanarity and topological constraints. They then solve the correspondence
problem by finding a maximum posteriori using loopy belief propagation. Kawasaki et
al. [52] use a bicolor grid and local connectivity information to achieve dense shape recon-
struction. Given that the proposed technique does not involve encoding positional informa-
tion into multiple pixels or color space, it provides good results even when discontinuities
and/or occlusions are present. Similarly, Chen et al. [21] present a specially-coded vision
system where a principle of uniquely color-encoded pattern is proposed to ensure the re-
construction efficiency using local neighborhood information. To improve color detection,
Fechteler and Eisert [36] propose a color classification method where the color classifi-
17
cation is made adaptive to the characteristics of the captured image, so distortion due to
environment illumination, color cross-talk, and reflectance is well compensated.
In spite of the aforementioned development, there is still space to further improve one-
shot methods, particularly in terms of speed. For example, the method reported in [104],
takes 10 iterations to converge, which costs about 3 minutes to recover thousand intersec-
tions. Similarly, the approach in [36] takes a minute to reconstruct an object with 126544
triangles and 63644 vertices.
Markerless Tracking and Pose Estimation
Given tracking and pose estimation are vital to our project, this section provides an overview
of the current development in those areas and explain in detail the key steps involved for
creating an appropriate computer vision based markerless tracking system. Unlike most
marker-based tracking methods, markerless tracking is designed for uncontrolled and un-
known environment, where the tracking process works without adjusting the object or the
environment to be tracked, such as placing fiducial landmarks or references. The natural
features of the scene such as textures, blobs, corners, edges, lines and circles are used to
uniquely identify the objects. In general this markerless tracking process involves four
main mechanisms, feature detection, description, matching and pose estimation. The
first three steps find strong features of the object and produce description/feature vectors of
the target object. The target is then tracked over the frames by matching descriptors/feature
vectors. Provided the tracking information and camera initial pose in a 3D coordinate sys-
tem, the camera’s new pose with respect to the object is determined. To account for a robust
pose estimation, several conditions must hold:
18
• Robustness: the tracking must be robust against noise and able to recover from failure
• Performance: the tracking must be fast to ensure AR performed in real-time.
• Set-up time: the initialization time must be as short as possible for a practical AR
application.
To address the problems above, many markerless tracking and pose estimation methods
are proposed, all of which can be categorized into two classes: recursive tracking [103]
and tracking by detection [5]. Recursive techniques start the tracking process using an
initial guess or a rough estimation, and then refine or update it over time. They are called re-
cursive because the next estimation relies on the previous one, for example, feature match-
ing is performed on two consecutive frames. In contrast, tracking by detection techniques
can do frame by frame computation independently from previous estimation. In this case,
a priori information about the environment or the objects to be tracked is needed, for ex-
ample, a computer generated object 3D model. The feature matching is only between the
current frame and the initial frame. The advantage of recursive tracking is smoothness and
robustness against sudden changes, but it may result in an error accumulation over the en-
tire tracking process. While tracking by detection is free from error accumulation [112], it
is much more computationally expensive given that a classifier [3, 50] or object 3D model
training [27] is needed.
Feature detection. To allow for an accurate pose determination, feature detection
must meet several requirements [39]:
Fast computation time. It must be possible to calculate a rather large number of
features points and associated descriptors in real-time in order to allow for a pose estimation
at an acceptable frame rate.
19
Robust against lighting, viewing angles and noise. The set of feature points to be
selected and their calculated descriptors shall not vary significantly under different light-
ing, noise and viewing angles. Some feature detection algorithms are highly sensitive to
the change of lighting and image taking devices noise, since they use image intensity as
features. In most AR applications, the users’ position and orientation are usually not fixed.
Thus, any feature points that are restricted to only one distance or viewing angle will be
useless for AR. All effects are quite common, particularly in outdoor environment.
Scale invariant feature choices. The scale of the captured object changes as the user
moves around the scene especially in outdoor environments due to perspective projection.
Thus, scale invariance must retain in the entire AR application, namely feature points vis-
ible from one distance will not disappear when getting closer or farther, allowing for a
continuous tracking event.
Visual markerless pose trackers mainly rely on natural features in the user’s environ-
ment. Those features include blob-like image regions that has higher or lower intensi-
ties [30, 66, 72], edges [16], corners that are intersections of edges with stronger fea-
tures [14, 44, 68, 73, 87, 95, 101] , lines and circles [116]. Research has proved that corner
detectors are more efficient than blob or edge detectors, which allow for a more precise
position determination.
Majority of feature detection algorithms work by computing a corner response function
C across the image. Pixels which exceed a intensity-related threshold are then classified as
corners [87].
Moravec [73] computes the sum-of-squared-differences (SSD) between a patch around
a candidate corner and patches shifted from the candidate a small distance in a number of
20
directions. C is then the smallest SSD to be obtained, ensuring that extracted corners are
those locations which change maximally under translations.
Harris [44] builds a Hessian matrix of an approximated second derivative of the SSD
as:
H =
Ixx Ixy
Ixy Iyy
(2.1)
Harris then defines the corner response to be
C = |H| − k(trace(H))2 (2.2)
where eigenvalues are (λ1, λ2) and
|H| = λ1λ2 (2.3)
trace(H) = λ1 + λ2 (2.4)
A window with a score C greater than a threshold is considered a ”corner”.
Following Harris’s framework, Brown and Szeliski [14] define C as harmonic mean of
the eigenvalues (λ1, λ2) of H:
C =|H|
trace(H)=
λ1λ2λ1 + λ2
(2.5)
In a similar fashion, Shi and Tomasi [95] assume affine image deformation and con-
clude that it is better to use the smallest eigenvalue of H as the corner strength function:
21
Figure 5. Corner definition by Harris [44]. Figure obtained from OpenCV documentation[13]
C = min(λ1, λ2) (2.6)
Lowe [68] builds an image pyramid with different scales and resolutions.The scale
invariance is obtained by convolving the image pyramid with a Difference of Gaussians
(DoG) kernel, retaining locations which are optima in scale as well as in space.
Feature description. Once feature point are detected and extracted, how can com-
puter tell which feature point in this image matches that in another image. Feature vectors
or descriptors are designed to describe the attributes of a feature point by taking struc-
tural, neighborhood and region information into consideration. A feature descriptor must
contains the necessary information to distinguish two different features. For a markerless
AR system, feature descriptors should be scale and orientation invariant, not only allowing
for fast calculation but also be able to describe the concerning interest point as unique as
22
possible in order to reduce ambiguity and mismatches.
Several different feature descriptors have been proposed in the last decade, such as SIFT
[68], SURF [8] an improved version of SIFT, binary descriptors BRIEF [15], ORB [88],
BRISK [62], and FREAK [1].
The feature vector of SURF is almost identical to that of SIFT. SIFT describes the
features using the local gradient values and orientations of pixels around the keypoint. To
create a keypoint descriptor, the gradient magnitude and orientation at each image sample
point in a region around the keypoint location is firstly calculated. These samples are then
counted into orientation histograms summarizing the contents in subregions of 4×4. Then
for each subregion, the length of each arrow is assigned the sum of the gradient magnitude
in/near that direction. Then a SIFT descriptor is a 128-element vector consisting of sum of
the gradient magnitude. The high dimensionality makes it difficult to use SIFT descriptor
in real time, while SURF improves on SIFT by using a box filter approximation to the
Gaussian derivative kernel. This convolution is sped up further using integral images to
reduce the time spent on this step [91]. As described in [8] an orientation is first assigned
to the keypoint when constructing a SURF descriptor. Then a square region is constructed
around the keypoint and rotated according to the orientation. Similar to SIFT, the region
is split up regularly into smaller 4×4 square subregions. For each subregion, we compute
a few simple features at 5×5 regularly spaced sample points. The horizontal and vertical
Haar wavelet responses dx and dy are computed and summed up in each subregion and
form a first set of entries to the feature vector. The absolute values of the responses dx
and dy are also calculated, and together with the sum of vector to form a four-dimensional
descriptor. And for all 4×4 subregions, it results in a vector of length 64.
23
Figure 6. Feature matching. Figure obtained from OpenCV documentation [13].
BRISK is a 512 bit binary descriptor that computes the weighted Gaussian average
over a selected pattern of points near the keypoint. FREAK (Fast Retina Keypoint) is an
improved version of BRISK which is also a binary descriptor. This method is inspired
by human retinal pattern in the eye, thus it samples the points using weighted Gaussians
at a region around the keypoint. Comparisons among these feature descriptors in term of
rotation, scaling, change of viewpoint, brightness shift, Gaussian blur are pretested in [1].
Feature matching. Once a set of robust features have been detected and described,
they will be matched by comparing the distances between descriptors over the images
(Fig.6).
Feature matching is one of the most critical tasks in the entire tracking pipeline. A large
number of features may better describe the points however, the high number of potential
feature correspondences will result in a large computational effort, slowing down the track-
ing process significantly. Additionally, wrong feature pairs (so called outliers) would result
24
in wrong pose estimation.
Feature matching by image patches(also called template/block matching) is the easiest
way to find valid feature correspondences between images. The image patch matching first
selects a patch or a constant windows size(e.g. 8×8 pixels) to extract features, for example,
intensity and then compares this patch with all patches in other images. The goal is to
find the best matching patch with the closest distances using the summed square distance
function (SSD) or the sum of absolute distance function (SAD). Although image patches
are very error-prone with respect to changing light conditions or viewing angles, they are
not scale or orientation invariant. Thus, image patches can only be used for recursive
tracking while error accumulation is the drawback if no error checking mechanism is used.
To produce a reliable tracking result, scale and orientation invariant features such as
SIFT, SURF, FAST, BRISK and FREAK are usually used. Those features can be matched
by finding the two descriptors with e.g. smallest summed square distance or smallest
summed Hamming distance(SHD) if the descriptor is binary, for example BRISK and
FREAK. Further, the SSD or SHD should lie below a specified threshold to ensure a reliable
matching result. The threshold has to be defined carefully to avoid rejecting valid feature
correspondences on one hand, and accepting too many wrong feature pairs on the other
hand. Each type of feature detector needs its individual threshold to achieve an adequate
ratio between accepted and rejected feature correspondences.
If no information can be used to reduce the number of potential feature candidates,
e.g. from previous tracking iterations, an initial mapping can be extracted by a brute-
force search. However, brute-force is computational intensive with O(nm) for n image
features and m map features thus it should be applied as rare as possible. Fast Library
25
for Approximate Nearest Neighbors (FLANN) usually performs classification faster than
brute-force even with a large database of images. This method as detailed in [75] basically
performs a parallel searching of randomized hierarchical trees to achieve an approximate
matching of binary features. As discussed in [75], the hierarchical clustering tree gives
significant speedup over linear search over a binary and float descriptors.
To speed up the searching process for the best matching descriptors, some research
uses a kd-tree [9] or applies an feature culling mechanism [39]. For example the method
can uniquely categorize SURF or SIFT features into two different kinds of blob-like fea-
tures [39]: dark features with brighter environment or bright features with darker environ-
ment. Thus, the matching process can be speedup if two SURF features belong to different
categories without calculating vector distances.
Pose estimation. Camera pose is about the camera’s localization (X, Y, Z) and ori-
entation (θx, θy, θz) in the world coordinate system, where (θx, θy, θz) are rotation around
X, Y, Z axis respectively. Given a set of objects’ 3D points O0 = (P1,P2,P3...Pi) in
the world coordinate system, a pose estimation means to find an object’s 3D pose [R|t]
from a set of 2D point projections oj = (p1,p2,p3...pi) at the jth frame. Note that the
projection relationship means the correspondences C = Oj ⇐⇒ oj that are matches of
an object’s 3D points Oj detected on the camera frame oj. With a sufficient number of
correspondences the camera pose can be determined uniquely by:
oj = MOj, (Oj = O0) (2.7)
M = A[Rj|tj] (2.8)
26
where A is camera intrinsic matrix, Rj and tj are rotation matrix and translation vector
that brings the camera to its current pose at the jth frame. The definitions of R and t are
then given in Eqs. (2.9), and (2.13).
R = RZRYRX =
r11 r12 r13
r21 r22 r23
r31 r32 r33
(2.9)
RX =
1 0 0
0 cos (θx) − sin (θx)
0 sin (θx) cos (θx)
(2.10)
RY =
cos (θy) 0 sin (θy)
0 1 0
− sin (θy) 0 cos (θy)
(2.11)
RZ =
cos (θz) − sin (θz) 0
sin (θz) cos (θz) 0
0 0 1
(2.12)
t =
tx
ty
tz
(2.13)
The most general version of the pose estimation problem requires to estimate the 6-DoF
of the extrinsic parameters [R|t] and five intrinsic parameters in A: focal length, principal
point, aspect ratio and skew. It could be established with a minimum of 6 correspondences,
27
using Direct Linear Transform (DLT) algorithm [97]. There are, though, several simplifi-
cations to the problem that turn into an extensive list of different algorithms to improve the
accuracy of the DLT.
In practical AR system, several different markerless pose estimation algorithms are
implemented [40, 61, 84, 97], a comparison among them is given by [77]:
Direct Linear Transformation (DLT). These algorithms intend to directly estimate
the camera pose [R|t] ignoring certain restrictions on the solution space.
Perspective n-Point (PnP). The most common simplification of DLT is Perspective-
n-Point (PnP) problem [79,98]: given a set of correspondences C = Oj ⇐⇒ oj between
3D points and their 2D projections, we seek to retrieve the pose ( R and t) of the camera
w.r.t. the world and the focal length F . At least n points are needed, usually camera
intrinsics A is a prior knowledge.
A priori information estimators. Before pose estimation, some information regard-
ing the camera pose is often available. The algorithms belonging to this class use the priori
information as initials to provide more reliable and faster results.
Regardless of restrictions or assumptions, almost all algorithms share a similar routine
of direct linear transform (DLT), which recovers a 3× 4 camera matrix M , recall (2.7):
M =
M11 M12 M13 M14
M21 M22 M23 M24
M31 M32 M33 M34
= A[R|t] =
fx 0 cx
0 fy cy
0 0 1
r11 r12 r13 t1
r21 r22 r23 t2
r31 r32 r33 t3
(2.14)
28
From the perspective projection equation (4.2):
ui =M11Xi +M12Yi +M13Zi +M14
M31Xi +M32Yi +M33Zi +M34
(2.15)
vi =M21Xi +M22Yi +M23Zi +M14
M31Xi +M32Yi +M33Zi +M34
(2.16)
where pi = (ui, vi) are the projected 2D points of the object’s points Pi = (Xi, Yi, Zi).
This system of equations can be solved in a linear fashion for the unknowns in the camera
matrix M by multiplying the denominator on both sides of the equation. The resulting
algorithm is called the direct linear transform (DLT) and is commonly attributed to [97]. In
order to compute the 12 (or 11) unknowns in M , at least six correspondences between 3D
and 2D locations must be known.
More accurate results for the entries in M can be obtained by directly minimizing the
set of Equations (2.16), using non-linear least squares with a small number of iterations.
Once the entries in M have been recovered, it is possible to recover both the intrinsic
calibration matrix A and the rigid transformation (R, t) using RQ factorization.
In the case where the camera is already calibrated, i.e., the matrix A is known, we can
perform pose estimation using at least 3 points, this method is known as P3P [40], but the
results are noisy. The so-called P3P provides at most four different poses due to geometric
ambiguities. Thus, necessary further feature correspondences are used to determine the
unique pose. Several approaches have been proposed to solve the P4P, P5P or even PnP. The
PnP problem can also be solved by the iterative method [102] or the non-iterative efficient
algorithm EPnP [61] with the efficiency of O(n) (where n ≥ 4). The key implementation
of the algorithm with such linear complexity is, to represent the Pi as a weighted sum of
29
control points (ω1...ωm) where m ≤ 4, and to perform all further computation only on
these points. Although the more feature correspondences the more accurate the extracted
pose is, a set of detected feature correspondences usually contain invalid feature pairs due to
mismatching leading to inaccurate or even invalid pose calculations. Thus, invalid features
correspondences must be exposed explicitly. The well-known RANSAC [38] algorithm can
be used to determine the subset of valid correspondences from a large set of feature pairs
with outliers. RANSAC randomly chooses few feature pairs to create a rough pose using
the algorithm such as P3P, and tests this pose geometrically for all of the remaining pairs.
Several iterations are then applied using different permutations of feature pairs, resulting in
pose where most feature correspondences could be confirmed (Fig. 7). However, RANSAC
does not optimize the final pose for all valid feature correspondences. As [77] explains, the
quality of RANSAC reduces with the increasing ratio of outliers. When the outlier ratio
reaches 60%, it is not possible to estimate a robust camera pose within a reasonable time
frame. Therefore, the pose should be refined to reduce the overall error of all applied feature
correspondences. One refinement method is to optimize the pose with respect to the total
projection error. Commonly, several iterations of nonlinear least squares algorithms e.g.
the Levenberg-Marquardt (LM) algorithm [83] are sufficient to converge to the final pose.
Further, different feature correspondences may be weighted e.g. according to their expected
position accuracy to increase the pose quality. For example, an M-estimator [83] can be
applied automatically neglecting the impact of outliers or inaccurate feature positions.
30
Figure 7. RANSAC used to filter matching outliers. The colored correspondences arecomputed using the SIFT feature detector providing many good matches along with a smallnumber of outliers. Left: All colored correspondences are used, including the outliers.Right: RANSAC is used in order to find the set of inliers (green correspondences). Figureobtained from OpenCV documentation [13].
31
Chapter 3
Fast 3D Reconstruction Using One-Shot Spatial Structured Light
Introduction
This chapter presents a fast 3D reconstruction method based on one-shot spatially multi-
plexed structured light. In particular, a De Bruijn sequence [105] is used for encoding a
color grid. To improve the resolution and field of view (FOV) under larger dynamic scenes,
we extend our prior work [100] to use 8 colors instead of 6 [81]. After adding two more
colors, the resolution is increased from 841 points to 4356 points. But this also brings up
a problem for color detection. To compensate that, we propose a Hamming distance based
method to improve the robustness. With only a single shot of a color grid, the depth infor-
mation of a target object is extracted in real-time. Given our focus on mobile applications
with dynamic scenes, some detail, such as sub-pixel accuracy, is no longer our concern.
Instead, with the aim of achieving a balance of efficiency and accuracy, we strive to im-
prove the speed of the proposed method while a decrease in accuracy can be tolerated by
the system. In our experiment, the entire process from projecting a pattern to triangulating
for mesh reconstruction takes less than 1 second.
The rest of the chapter is organized as follow: Section overviews our proposed ap-
proach, followed by its detail in Section .
Overview of the Approach
The proposed structured-light method consists of a color camera and a projector, both of
which are calibrated with respect to a world coordinate system and have their intrinsic and
extrinsic parameters known. When the projector projects a color-coded light pattern onto
32
Figure 8. System architecture
a 3D object, the camera captures its image immediately. The 3D information of the object
is then obtained by analyzing the deformation of the observed light pattern with respect to
the projected one and identifying their correspondence.
Matching pairs of light points and their images present several challenges. First, with
the aim to project only one light pattern for real-time applications, there is a tradeoff be-
tween reliability and accuracy to select proper color stripes in terms of the number of colors,
length of codeword, and dimensions. In our method, a two-dimensional 8-color De Bruijn
spacial grid is used to offer a confident solution as explained in Section III.A. The more
colors to use, the higher density and resolution. But the more colors, the better chance of
color misdetection. To overcome this drawback, our method uses a two-level color label
correction method. The grid pattern preserves spatial neighborhood information for all cap-
tured grid points. For instance, if two points are connected vertically/horizontally on the
captured image, their correspondences must follow certain topological constraints (e.g., the
same vertical/horizontal color). Those constraints are then used to correct color detection
errors as detailed in Section III.B. When dealing with more complex scenes that involve
shadows, occlusion, and discontinuities, observed geometric constraints can be mislead-
ing [104]. Thus, a De Brujin-based Hamming distance minimization method is established
for further correction.
33
3D Reconstruction Process
As illustrated in Fig.8, the reconstruction process starts with projecting, capturing, and im-
age processing in order to retrieve the grid overlain on the object. The recovered stripe
substrate forms a 2D intersection network to solve the correspondence problem. Without
loss of generality, the structured light codification is first presented in Section , followed
by a brief explanation of the image processing to obtain a 2D colored intersection net-
work in Section . The detailed methods to remedy color detection errors and incomplete
neighborhood information due to occlusion are then described in Section and .
Color-coded structured light. Our projected grid pattern is composed of vertical
and horizontal colored slits with a De Bruijn sequence. A De Bruijin sequence of order
n over an alphabet of k color symbols is a cyclic sequence of length kn with a so-called
window property that each subsequence of length n appears exactly once [105]. When
applying such a sequence to both vertical and horizontal stripes, it forms a m × m grid,
where m = kn + 2 and a unique k-color horizontal sequence overlain atop a unique k-
color vertical sequence only occurs once in the grid. Let C be a set of color primitives,
C = cl|cl = 1, 2, 3, ..., 8, where the numbers represent different colors. Thus, each
crosspoint in the grid matrix, vi, is an intersection of a vertical stripe and a horizontal stripe
with its unique coordinate (xi, yi) ∈ Ω, where Ω is the camera’s image plane and colors
(chi , cvi ),where chi , c
vi ∈ C, representing the horizontal and vertical colors of vi, respectively.
Therefore, the color-coded grid network can be interpreted as a undirected graph
G = (V,E, P, ξ, ϑ, χ, τ), where
34
• V = v1, v2, . . . vm×m represents grid intersections.
• eij ∈ E represents a link between two nodes i and j in the grid
• P is a vector of 2D coordinates of vj in the camera’s image plane, where pj =
(xj, yj); (xj, yj) ∈ Ω. `(P ) denotes the length of P .
• ξ: V → C is the color labels associated with a node where ξ(vi) = (chi , chj ); ξ
h(vi) =
chi , ξv(vi) = cvi .
• ϑ: V → ϑ(v1), ϑ(v2), . . . ϑ(vm×m), where
ϑ(vi) =⋃j Nbj(vi); j ∈ north, south, west, east; Nbj(vi) ∈ V represents the
neighbors of vi. When the jth neighbor of vi does not exist, Nbj(vi) is assigned a
value of NULL.
• χ: ϑ→ 0, 1 is a decision function with a binary output.
∀j ∈ north, south, west, east
∃Nbj(vi) ∈ ϑ(vi) = NULL ⇒ χ(ϑ(vi)) = 0,
else χ(ϑ(vi)) = 1.
• τ : V → P is a function for getting the 2D coordinates of vi, where τ(vj) = pj
In our implementation, C = 1, 2, ..., 8, representing Red, Yellow, Lime, Green, Cyan,
Blue, Purple and Magenta. The colors used for the horizontal stripes are Red, Lime, Cyan
and Purple, while Yellow, Green, Blue and Magenta are used for vertical stripes. They are
evenly distributed in HSV space for better color detection. Fig. 9 shows the resulting grid.
The total of 8 colors are significant in improving the density and resolution in comparison
35
to other earlier works [113], [20].
Figure 9. Color Grid Pattern
Image processing for grid extraction. There are several image processing methods
used in our implementation to retrieve the stripe substrate material overlain on the real
world scene. First, an adaptive thresholding with a Gaussian kernel is used to correctly
segment the projected and deformed grid. The idea is to check if the RGB value of every
pixel is greater than a threshold which is the Gaussian average of a local intensity. Given
that eight colors are used for vertical and horizontal stripes in our implementation, such a
threshold is not fixed but rather adapts to the local intensity of a 9× 9 neighborhood of the
pixel. The Hilditch thinning algorithm [45] is then applied until the skeleton of the grid is
obtained. Based on the skeleton of the grid, intersections can then be located. The idea is
to count the number of total rises, (i.e. changes in the pixel value from 0 to 1) and the total
neighbors with a value of one or high within a 3× 3 window centered at a pixel position of
interest. The position is considered as an intersection of the grid as long as the total rises
and the number of neighbors with a value of one are not less than 3 or the total rises is
greater than 1 and the number of neighbors with a value of one is great than 4.
36
Due to the complexity of the scene (e.g., shadows, distortion, depth discontinuity, and
occlusion, etc.), there exist spurious connections and holes, especially at the border of the
background and the object. To this end, the watershed algorithm [108] is first used for the
segmentation of the target object from the background, resulting in two grid networks. The
traversal algorithm used in our prior work [100] is then applied to the networks individually
with the purpose of determining the northern (Nbnorth(vi)), southern (Nbsouth(vi)), east-
ern (Nbeast(vi)) and western (Nbwest(vi)) neighbors of each intersection vi. The algorithm
moves a 3× 1 (for the north-south traversal) or 1× 3 (for the west-east traversal) window
along the direction it traverses to the most recently chosen path pixel, until either an inter-
section is found in this manner, or no intersection is found in an allotted maximum distance
from the original observed intersection. An intersection vi with four neighbors determined
is considered a perfect intersection which will be used for triangulation. Due to the
aforementioned challenges, not all the intersections have complete neighborhood informa-
tion. To avoid of losing those intersections when decoding the codeword for triangulation,
further process is needed as elaborated in Section .
Color labeling and correction. Each intersection in our color-coded grid is repre-
sented by its unique neighborhood information, particularly the colors of its neighbors. To
determine the color of each intersection, the captured color gird is blurred with a 3 × 3
Gaussian filter with a mask of skeleton grid obtained in Section III.B. The purified color
grid is then converted from RGB to HSV for color labeling of each pixel, where 8-pair
thresholds corresponding to the Hue ranges of the 8 colors are applied. To compensate for
the color crosstalk, two morphological openings (i.e., 5× 1 and 1× 5 structuring elements
are used to separate horizontal and vertical stripes.
37
To determine the color labels of each intersection (i.e., chi , cvi ), a special vote majority
rule is applied to a 9 × 9 neighborhood window of the intersection, where a 360 angular
search is used to populate the histograms of 8 color labels. Each histogram of a particular
color label has 18 bins, each of which corresponds to a non-overlapping 20 range. As
stated in Section III.A, the color labels 1, 3, 5, and 7 are used for the horizontal stripes,
while 2, 4, 6, 8 for the vertical stripes. Let ρk counts the number of observations that fall
into each of the κ bins in a histogram and ρclκ is a type of ρκ in the histogram of the color
label cl. The color labels for the intersection can then be determined as follows:
chi = g( maxcl=1,3,5,7k=1,2,3...,18
(ρclκ )) (3.1)
cvi = g( maxcl=2,4,6,8k=1,2,3...,18
(ρclκ )) (3.2)
where g(x) : x→ cl, cl ∈ C maps x to its corresponding color label.
Although the special vote majority rule substantially minimizes the impact of ill color
detection on color labeling for an intersection, there are still errors. In view of those errors,
a topological constraint is then used for further correction. If intersections are linked hori-
zontally in the camera image, their correspondences must be collinear. There is more than
that. The intersections should have the same color labels as they are on the same horizon-
tally stripe. The same idea should apply to the intersections on the same vertically stripe.
Thus when plotting the histogram of the color labels of the intersections on the same row
or column, further correction is performed, where the row/column histogram has 4 bins,
38
each corresponding to one of the four colors used for horizontal/vertical stripes.
(chi ) = g(max(ρhκ)), κ = 1, 2, 3, 4 (3.3)
(cvi ) = g(max(ρvκ)), κ = 1, 2, 3, 4 (3.4)
where ρhκ and ρvκ are a type of ρκ in the histogram of color labels for horizontal and
vertical stripes, respectively.
Neighbor completion using hamming distance. Each intersection vi together with
its four neighbors forms two unique De Bruijn subsequences, one for the horizontal string
(i.e., (ξh(Nbwest(vi)), chi , ξh(Nbeast(vi))) and the other for the vertical string (i.e., (ξv(Nbnorth(vi)),
cvi , ξv(Nbsouth(vi)). One difficult in finding the correct correspondences is that not all in-
tersections have complete neighborhood information even after the traversal elaborated in
Section III.C. The codewords resulted from those so-called ”imperfect” intersections are
literally broken, making it impossible to map them to the projected De Bruijn color grid.
To this end, our method proposes a Hamming Distance-based amendment procedure. The
procedure first defines a function f(vi) that returns color labels of itself and its neighbors’
for the intersection vi,
f(vi) = [(rvw, rhw); (cvi , c
hi ); (rvn, r
hn); (rve , r
he ); (rvs , r
hs )] (3.5)
where (rvw, rhw); (rvn, r
hn); (rve , r
he ); (rvs , r
hs ) represent western, northern, eastern, and southern
neighbors’ horizontal and vertical color labels of vi. If vi has complete neighborhood infor-
mation, the value of r can be retrieved from ξ() in the grid graph; otherwise, the neighbors
39
are virtual with the colors to be determined through the procedure. The algorithm then
traverses the retrieved intersections using a 3 × 4 or 4 × 3 moving window, depending on
the task of either labeling vertical or horizontal color of the missing neighbors. For each
intersection with missing neighbors, the procedure assigns the color labels of its missing
neighbors to 0. For instance, if the northern neighbor of vi is missing, rvw = rhw = 0 . Doing
so makes the originally broken codewords complete while invalid. Such an amended but
faulty word should have the minimal Hamming distance to its corresponding correct one.
According to this idea, the procedure then creates virtual neighbors of each ”imperfect”
intersection with correct color labels. Compared to real intersections, virtual intersections
only have color labels without coordinates. The detailed amendment procedure is given in
Algorithm 1 shown in Appendix.A. For clarity, the algorithm only includes the procedures
of creating virtual neighbors and assigning their vertical color labels. The same procedures
except for the moving window size should be applied to determine their horizontal colors.
In Algorithm 1 (Appendix.A), the inputs are
• G: original retrieved intersection grid.
• ~v = s0, s1, ..., sm×m−1: a matrix set where each element sj is a 3 × 4 matrix
with each row element forming a vertical De Bruijn subsequence (a0j, a1j, a2j, a3j)
of length 4,
sj =
a0j a1j a2j a3j
a0j a1j a2j a3j
a0j a1j a2j a3j
.
• H: the function for calculating per-element Hamming distance between two matri-
40
ces.
The outputs are:
• rvs , rvw, rve and rvn.
• G′: amended intersection grid with virtual neighbors.
The two matrices πv and πh are of size 3 × 4 and 4 × 3 for horizontal and vertical
separately.
41
Chapter 4
Markerless Tracking and Camera Pose Estimation
Introduction
The proposed markerless AR system consists of two main stages: range finding using
SL (Chapter 3) and markerless tracking by matching SURF descriptors and camera pose
estimation using PnP. The system architecture is shown in Fig. 10. The initialization,
image processing and reconstruction(range finding) are described in detail in Fig. 8. In this
chapter, the tracking and pose estimation process (the loop in Fig. 10) are discussed.
Figure 10. The proposed markerless AR system
System Calibration
A binocular 3D reconstruction system consists of an optical camera and a projector, where
the projector is used for pattern projection and the camera for image acquisition. The
precision and accuracy of reconstruction and tracking is highly dependent on the intrinsic
and extrinsic parameters of the camera and projector. Thus a proper system calibration is a
prerequisite for correct computation of any scaled or metric results from their images.
42
The intrinsic camera parameters consist of horizontal and vertical focal lengths fx, fy,
the principal point (cx, cx) of the camera and the skew. In our system the skew is assumed
to be zero. fx and fy can be calculated from camera/projector focal length F , Field Of
View(FOV) and the principal point is the image center. In order to compensate distortion
of the camera, the radial and tangential distortion coefficients up to the second order need
to be computed. In order to calibrate these, a sufficient amount of 2D-3D correspondences
has to be created, which is typically done using a known target
such as a chessboard. For the calibration of a low resolution camera, we use the cali-
bration method proposed by Zhang [118] through OpenCV [13] library. In this method, the
size of the chessboard and a single cell is defined by user. Several shots of the chessboard
are then captured and its inner corners are detected. The reprojection error between the
detected corners and the chessboard corners is reduced by iteratively updating the intrinsic
and extrinsic parameters of the camera.
Extrinsic parameters are translation vector t and rotation matrixR that bring the camera
to the projector in the 3D world coordinate system.
For each image the camera takes on a particular object, we can describe the pose of the
object relative to the camera coordinate system in terms of a rotation and a translation; see
Fig. 11.
A pinhole camera model describes a perspective transformation from 3D to 2D, where
a scene view is formed by projecting 3D points into the image plane using Eq. 4.1.
s p′ = A[R|t]P ′ (4.1)
43
Figure 11. Converting from object to camera coordinate systems: the point P on the objectis seen as point p on the image plane; the point p is related to point P by applying a rotationmatrix R and a translation vector t to P [49]
or
s
u
v
1
=
fx 0 cx
0 fy cy
0 0 1
r11 r12 r13 t1
r21 r22 r23 t2
r31 r32 r33 t3
X
Y
Z
1
(4.2)
where:
• (X, Y, Z) are the coordinates of a 3D point P in the world coordinate space
• (u, v) are the coordinates of the projection point p in pixels
• A is a camera matrix, or a matrix of intrinsic parameters
• (cx, cy) is a principal point that is usually at the image center
44
• fx, fy are the focal lengths expressed in pixel units.
The intrinsic parameters are independent on the scene viewed. Once calibrated, it can
be re-used as long as the focal length is fixed. The joint rotation-translation matrix [R|t]
is called a matrix of extrinsic parameters. It describes the camera motion around a static
scene, or vice versa, rigid motion of an object in front of a still camera. The rigid transfor-
mation above is can be written as (when z 6= 0 ):
x
y
z
= R
X
Y
Z
+t
x′ = x/z
y′ = y/z
u = fx ∗ x′ + cx
v = fy ∗ y′ + cy
Real lenses usually have some distortion, mostly radial distortion and slight tangential
distortion. So, the above model is extended as:
45
x
y
z
= R
X
Y
Z
+ t
x′ = x/z
y′ = y/z
x′′ = x′ 1+k1r2+k2r4+k3r6
1+k4r2+k5r4+k6r6+ 2p1x
′y′ + p2(r2 + 2x′2)
y′′ = y′ 1+k1r2+k2r4+k3r6
1+k4r2+k5r4+k6r6+ p1(r
2 + 2y′2) + 2p2x′y′
where r2 = x′2 + y′2
u = fx ∗ x′′ + cx
v = fy ∗ y′′ + cy
Where k1, k2, k3, k4, k5, and k6 are radial distortion coefficients. p1 and p2 are tangential
distortion coefficients. The distortion coefficients do not depend on the scene viewed. Thus,
they also belong to the intrinsic camera parameters. For the distortion we take into account
the radial and tangential factors [13]. For the radial factor we use the following formula:
xcorrected = x(1 + k1r2 + k2r
4 + k3r6) (4.3)
ycorrected = y(1 + k1r2 + k2r
4 + k3r6) (4.4)
So for an pixel point at (x, y) coordinates in the original image, its position on the
corrected image should be (xcorrected, ycorrected). Tangential distortion occurs when the
camera lenses are not perfectly parallel to the imaging plane. It is corrected using the
46
formulas below:
xcorrected = x+ [2p1xy + p2(r2 + 2x2)] (4.5)
ycorrected = y + [p1(r2 + 2y2) + 2p2xy] (4.6)
So we have five distortion parameters presented as one row matrix with 5 columns:
Distortioncoefficients = (k1 k2 p1 p2 k3) (4.7)
The projector’s intrinsic and camera-projector pair extrinsic parameters can be cali-
brated using the aforementioned method by treating the projector as a reverse camera, we
use the software provided by [74] for system calibration.
Feature Based Tracking
As reviewed in Section , in a practical AR application, markerless tracking must meet the
following requirement:
• Robustness: the tracking must be robust against noise and be able to recover from
failure.
• Performance: the tracking must be fast enough to ensure AR perform in real-time.
• Set-up time: the initialization time must be as short as possible for a practical AR
application.
In our markerless system, SURF detector and descriptor are used due to its robustness
against weak light condition, invariant to scale/orientation and fast computation. All of
47
these features ensure a real-time tracking.
SURF detector is recognized as a more efficient substitution for SIFT. It has a Hessian-
based detector and a distribution-based descriptor generator.
The SURF detector is based on the Hessian matrix, defined as
H(x, σ) =
Lxx(x, σ) Lxy(x, σ)
Lxy(x, σ) Lyy(x, σ)
(4.8)
L is the convolution of the Gaussian second order derivative with the image I in point
x. In the implementation, they use a simpler box filter instead of Gaussian filter in [68].
An image pyramid is built and convolved with the box filter to obtain scale-invariance
properties.
To describe a SURF keypoint, an orientation is first assigned to the keypoint. For each
direction a square region is constructed around the keypoint SURF descriptor is extracted.
The region is split up regularly into smaller 4× 4 square subregions. The Haar wavelet re-
sponses in horizontal dx and vertical dy are calculated and summed up over each subregion
and form a first set of entries to the feature vector. The absolute values of the responses |dx|
and |dy| are also calculated, and together with the sum of vector to form a four-dimensional
descriptor. And for all 4× 4 subregions, it results in a vector of the length of 64.
In our system, the camera captures an image of the scene at the initialization stage
and computes the SURF descriptors of each single detected keypoint. Note that the 3D
coordinates of these keypoints are already calculated in the previous 3D reconstruction
stage as described in Chapter.3. The 3D coordinates of the keypoints are assigned to the
3D coordinates of their nearest color grid intersections. In the end, these keypoints and
48
their descriptors, 2D locations are stored as a reference.
As the camera/object moves around the scene, SURF descriptors are extracted from
each incoming frame. Once SURF generates its float point descriptor, a Brute Force k-
nearest neighbors (k-NN) based matching algorithm is applied to find the 2 nearest matches
between the descriptors. The distance metric is L2 norm. To get rid of the mismatched key-
points pairs and guarantee strong correspondences, a threshold of distance ratio is applied,
only when the ratio between the best match distance and the second to the best match is
below a specific threshold e.g. 0.7 , the best match is determined as a ”correct” one.
Then, the detected 2D/3D feature correspondences are taken to determine the exact
camera 6-DoF pose by applying several Perspective-n-Point (PnP) iterations, followed by
a pose refinement step with a non-linear least-square algorithm. If enough feature corre-
spondences have been detected and validated by the RANSAC iterations, the number of
used features n will be decreased, therefore speeding the pose estimation process and the
6-DoF camera pose is computed using the n features.
Camera Tracking and Pose Estimation Using Perspective-n-Points
In our case, to allow for an accurate pose estimation the camera intrinsics are first cali-
brated using the method in Section . With the known camera intrinsics, iterative PnP [83]
that is based on Levenberg-Marquardt optimization is used for pose estimation. This func-
tion finds such a pose that minimizes reprojection error, which is the SSD between the
observed 2D projections on camera frames and the projected object points using formula
(4.1). Although iterative PnP is slower than P3P [40] or EPnP [61], it returns the most
robust estimation. The pose estimation procedure is as follows:
49
1. A set of object’s points are selected and their 3D world coordinates are reconstructed
using the SL range-finding method described in Chapter 3: (P1,P2,P3, ...Pi), (i ≥
6). Their initial 2D coordinates in the camera image plane are (p01,p
02,p
03, ...p
0i )
(Fig.15a ).
2. The SURF descriptor (d01,d
02,d
03, ...d
0i ) of the initial 2D keypoints (p0
1,p02,p
03, ...p
0i )
is extracted in the initial frame (frame 0).
3. As the camera/object moves in the scene, a set of keypoints (pj1,p
j2,p
j3, ...p
jk) in the
current incoming frame (jth frame) are detected and described by SURF, so do their
descriptors (dj1,d
j2,d
j3, ...d
jk).
4. A Brute Force 2-NN(k-NN k = 2) based matching algorithm (Algorithm 1) is
applied to find the correspondences between the initial 2D keypoints’ descriptors
(d01,d
02,d
03, ...d
0i ) and SURF keypoints’ descriptors (dj
1,dj2,d
j3, ...d
jk) in current frame
(jth frame) using L2 norm. In our practical application, a stable and fast matching
result is obtained using the parameters τ = 3, and i, k > 6 in Algorithm 1.
5. An iterative PnP algorithm is used to solve the rotation matrix R and translation
vector t using the correspondences between 3D coordinates (P1,P2,P3, ...Pi), (i ≥
6) and new 2D coordinates (pj1,p
j2,p
j3, ...p
jk).
6. Repeat steps 3-5 to update the camera pose as camera move around the scene (Figs.
15).
50
Algorithm 1 Brute Force k-NN based matching algorithm1: Create a matching score matrix ∆2: for m← 1 to k do3: for n← 1 to i do4: ∆[m,n] = δmn = L2(d
0m,d
jn)
5: end for6: end for7: for m← 1 to k do8: qmn = min(δm1, δm2, δm3, . . . , δmi)
q′ml = min ((δm1, δm2, δm3, . . . , δmi)\qmn)
9: ifqmnq′ml
> τ then
10: A strong match pair of discriminators (d0m,d
jn) is defined by m,n
11: end if12: end for13: The created matching score matrix ∆:14:
∆ =
δ11 δ12 δ13 . . . δ1iδ21 δ22 δ23 . . . δ2i...
...... . . . ...
δk1 δk2 δk3 . . . δki
51
Chapter 5
Experiment and Results
Our implementation setup is composed of (a) a desktop with an Intel quad-core (3.40
GHz) i7-2600k CPU, 8G DDR3 1600 memory, and a 64-bit Windows 7 operating system;
(b) a Point Grey Flea3 color camera running at a resolution of 640×480; and (c) an Optima
66HD DLP projector set to the resolution of 1280 × 800 as shown in Fig. 12. When the
projector projects a color-coded light pattern onto a 3D object, the camera located at 1-
meter away from the object captures the image. Both camera and projector are calibrated
with their intrinsic and extrinsic parameters in Tables. 1, 2and 3). To validate the accuracy
and speed of our proposed method, a plaster bust with a complex surface is used in our
experiments. The proposed markerless AR system then performs at 30 frames per second
(fps), which is considered real-time in our context.
Table 1
Camera and projector intrinsic parameters
Intrinsics fx fy cx cy k1 k2 p1 p2 k3 εCamera 1796.10 1792.37 259.97 288.02 -0.23 -5.46 0.00 0.00 0.00 0.09
Projector 1404.45 1348.78 477.97 276.15 -0.48 2.57 -0.03 -0.01 0.00 0.09
Table 2
Camera-projector pair rotation matrix
Rotation matrix (R) r11 r12 r13 r21 r22 r23 r31 r32 r33Cam-proj pair 0.95 0.06 -0.32 -0.03 0.99 0.11 0.33 -0.09 0.94
52
Figure 12. System setup: a projector on the bottom-left, camera on the right and a plasterbust in the scene.
53
Table 3
Camera-projector pair translation vector
Translation vector (t) tx ty tzCam-proj pair 377.29 -291.63 243.83
3D Reconstruction
(a) (b) (c)
Figure 13. SL reconstruction process: (a) A plaster bust in our experiment. (b) Color gridpattern projected onto the plaster bust. (c) Extracted color grid from (b)
Figs. 13 gives the object (Figs. 13a) and the captured intersection network with 62×60
color stripes (Figs. 14b). Due to the highly deformed surface, especially at the hair, eyes,
and nose areas, there are several holes as depicted in Fig. 13c. Most of them are due to
occlusion distortion. Note that with the help of the watershed algorithm, the impact of sharp
depth discontinuity between the background and object is minimal. The reconstructed point
cloud and mesh are shown in Fig. 14. In the experiment, the total number of intersections
in the projected De Bruijn grid (66× 66) is 4356. However, four horizontal and six vertical
stripes are out of the camera’s FOV, resulting in the size of the captured grid 62 × 60
with 3720 intersections. Due to occlusion distortion, there are 393 intersections missing,
54
especially in the areas of hair, eyes and boarders of the bust and background. With this
ground truth, our method is able to recover 3327 intersections with 267 labeled wrong and
110 containing incomplete neighborhood information. This corresponds to 88.6% correct
assignment.
(a) (b)
Figure 14. Reconstructed point cloud and mesh: (a) and (b)
With the focus on local optimization, such as the vote majority rule for color detection
and correction, and De Bruijn-based Hamming distance to improve neighborhood infor-
mation of intersections, our method achieves the best speed in comparison to the existing
one-short methods [104] (the running time for 1000 grid intersections was about 3 min-
utes), [21] (the running time for 1110 grid intersections was about 313 ms). In unoptimized
C++ code, the running time for 3327 grid intersections was about 1 second.
Markerless Tracking and Pose Estimation
In our AR system the world origin is at the camera optical center. As shown in Fig. 11 the
direction pointing from the camera optical center to the camera face is defined as positive
Z. Similarly, positive Y is defined as the direction from the camera optical center to the
camera bottom and positive X the direction from the camera optical center to the camera
55
right. The clockwise rotation around X, Y, Z axes (i.e., rz) is termed as pitch, yaw, and
roll), respectively. In a similar fashion, the translations along X, Y, and Z axes are defined
as tx, ty,tz respectively.
A tripod with a camera ball mount is used to enable camera rotation and translation. For
the ground truth, an iPhone 6 is mounted on the top of the camera. Thus, while the iPhone
rotates with the camera, its built-in sensor (i.e., InvenSense MP67B) can precisely measure
the rotation. On the other hand, a tape ruler is used to physically measure the translation.
Eventually, by moving the tripod around the scene, the performance of the developed sys-
tem is tested in terms of camera rotation and translation. In our system setup (Chapter 5),
we test the propose markerless AR system in a scene of 100 inches×60 inches×60 inches,
while the bust (or a part of the bust) stays in the camera’s FOV.
In the first set of experiments, the rotation and translation are computed and measured
between the camera initial pose and the current frame. As exemplified in Fig. 15, the first
set of SURF keypoints (red dots) are extracted in the initial frame (Fig. 15a), where an
initial pose is estimated using the 3D-2D correspondences and visualized by the X, Y, and
Z axes. The X axis (red) points to the right, Y axis (green) down, and Z axis (blue) the
inside of the image. While the camera moves in the scene, a new set of SURF keypoints
are extracted for new pose estimation as shown in Figs. 15b, 15c and 15d. Note that the
number of keypoints to be extracted varies each frame. Even when occlusion occurs as
shown in Fig. 15d, our AR system is still robust enough to correctly compute the pose.
In the second set of experiments, the camera’s motion data ( (i.e., rotation and transla-
tion) is obtained and compared with the ground truth at 27 different displacements of the
camera. Due to sensor noise, we acquire 10 groups of the motion data at each displace-
56
(a) (b)
(c) (d)
Figure 15. Camera tracking and pose estimation using PnP. The camera moves around thescene and its poses are estimated in real-time. The rotation and translation along X(red),Y(green), Z(blue) axis are rx, ry, rz and tx, ty, tz.(a) SURF keypoints (red dots) are extractedat initial pose. (b) Camera moves close to bust. (c) Camera moves away from bust. (d) Acorrect estimation with partial occlusion.
ment, including the pose estimation from our AR system (i.e., Cam(rx), Cam(ry), Cam(rz),
Cam(tx), Cam(ty), Cam(tz)) and the physical measurement (True(rx), True(ry), True(rz),
True(tx), True(ty), True(tz)) . The mean of the 10 data points are plotted and their corre-
sponding errors in comparison to the ground truth are presented in Fig. 16. Figs. 16a and
16c are for the rotation (in degrees), and Figs. 16b and 16d the translation (in inches). As
can be seen in Figs. 16a and 16b, the plots are well superimposed. With the tracked object
being 60 inches from the camera, the average absolute error of our system is 1.98 inches in
57
translation and 1.07 degrees in rotation (with a standard deviation of 1.91 inches and 2.42
degrees). More detail can be found in Table. 4. It can be noted that, the maximum error is
8.89 inches in translation and 7.25 degrees in rotation due to the following reasons:
• A rough calibration of the camera and projector pair causes 3D reconstruction and
tracking error.
• The physical measurement using tape ruler and iPhone 6 gyroscope has a low preci-
sion of 0.5 inch in translation and 1 degree in rotation.
• To allow for a real-time AR application, we set the camera shutter to be 15 ms and
gain to be 20 db, which results in noise increase and sometimes a flash effect that
lightness and contrast change suddenly on the incoming frame.
Table 4
Accuracy of the markerless pose estimation: Mean Error (inches and degrees), StandardDeviation, and Maximum Error
error(rx) error(ry) error(rz) error(tx) error(ty) error(tz)mean -0.03 -1.22 1.95 3.69 -0.90 1.35std 2.88 2.45 1.95 2.17 1.61 1.95max 6.98 1.97 7.25 8.89 1.17 8.65
Processing Speed Analysis
One of the development goals is to achieve real-time AR performance that supports users’
interactivity. When moving and rotating the camera along X, Y and Z axises respectively
our AR system consistently performs at an average of 30 fps as shown in Fig. 17. Note that
58
(a) (b)
(c) (d)
Figure 16. Accuracy of the proposed markerless pose estimation: (a) and (c) camera trans-lation measured using the tape ruler and our AR system. (b) and (d) corresponding errors(inches, degrees).
the measure of FPS is done when moving the camera in the scene described in Section 5.2.
The FPS is calculated by 1/processing time, where the process time that includes tracking
and pose estimation for a single frame.
To understand how each process plays in the overall system speed for future improve-
ment, we look into the time complexity of each process carefully. First, we measure the
time spent on the three processes (i.e., initialization, image processing, and reconstruc-
tion) that only occur once and only once in the AR pipeline. The initialization process
includes all the stages that create and assign global variables, such as De Bruijn color grid
pattern, camera-projector calibration, and the first frame. All the necessary procedures
59
Figure 17. Frame rate (fps) of the proposed method presented
that find intersections correspondences belong to image processing, for example, adaptive
thresholding to extract color grid, Gaussian blur to remove image noise, Hue thresholding,
morphological operations, and skeletonization, etc. The reconstruction involves intersec-
tion neighbors traversal, intersections color labeling, Hamming distance-based color label
correction, intersection 2D coordinates computation by decoding De Bruijn sequence, and
intersections triangulation. As can be seen in Fig. 18, the initialization costs 3641 ms, im-
age processing 693 ms and reconstruction 502 ms. The skeletonization (366 ms), morpho-
logical operations (201 ms) and thresholding (126 ms) are relatively expensive processes.
Second, we evaluate the tracking and pose estimation in a similar fashion. As shown
in Fig. 18b, the total running time for this process is 34 ms (29.4 fps). Although itera-
tive PnP and brute-force matching algorithms are used, the proportion of matching (5 ms)
and pose estimation (6 ms) in the entire iterative tracking and pose estimation process are
smaller than SURF feature detection and description (25 ms). According to Fig. 17, the
three procedures are executed at the average of 30 Hz. For each tracked frame, there are
60
(a)
(b)
Figure 18. Overall performance: (a) Initialization. (b) Iterative tracking and pose estima-tion.
approximate 10 SURF feature points to be tracked. As also shown in Fig. 17, there are
some drops in the FPS curve due to the increase of detected SURF feature points. For
instance, at the frame 193, the system detects 15 feature points. The extra time spent on
feature description, matching and iterative PnP pose estimation of those points results in a
reduced fps of 10.5 .
61
Chapter 6
Discussion and Conclusion
This thesis proposes an markerless augmented reality system that uses one-shot spa-
tially multiplexed structured light. In our AR system, the 3D points are obtained by SL
reconstruction, 2D points are obtained by SURF and pose is simultaneously estimated us-
ing 3D-2D correspondences. The application of SL successfully overcomes the deficiency
of existing marker and markerless AR methods that either utilize fiducial markers, prior
information about the real world scene, or environment-environment comparison, thereby
making the proposed methodology more suitable for AR in an unknown and uncertain
environment. The fast one-shot real-time 3D reconstruction (range-finding) method uses
De Bruijn encoded color pattern. Unlike previous one-short approaches, our formulation
focuses on local optimization, where a Hamming distance minimization method is used
for finding missing topological information and a special vote majority rule is applied for
color detection and correction. The proposed range-finding method is approved to be effi-
cient with a desirable accuracy achieved, even when applied to a very complicated scene.
Note that with the focus on mobile applications, such as mobile interactive augmented re-
ality [100], we emphasize more on speed over accuracy. Thus we only use about 10 in
our application to sufficient estimate for camera pose as a larger number of 3D points may
significantly slow down the process of tracking and pose estimation. The proposed feature
points based tracking and pose estimation method is tested in an indoor application, the
average absolute error is 1.98 inches in translation and 1.07 degrees in rotation, which is
considered reliable in our scenario. The entire system was tested on a desktop and was
62
shown to operate at speeds of approximately 30 Hz with tracking and pose estimation in
each frame.
Discussion and Future work
The evolution of the proposed algorithm is in its infancy and has many areas of potential
improvement, particularly in the following aspects:
• This work is only a proof-of-concept that our AR system can be implemented on
mobile devices. One of our next step is to transplant the code to mobile devices for
verification.
• Our AR system is based on SL and visual tracking, so color and feature points de-
tection is sensitive to the light condition. Therefore, an adaptive color and feature
detection algorithm with automatic outliers rejection mechanism is worth of future
investigation to enhance the robustness of 3D reconstruction and tracking.
• Another future step that is work-in-progress is to propose a camera-projector pair ex-
trinsic parameters calibration using SL. This method uses correspondences between
projected color grid pattern and camera frame to estimate an essential matrix using
RANSAC. The rotation matrix and translation vector are then obtained by decom-
posing the essential matrix. This calibration method simplifies the process described
in Section , since no extra software is needed.
• To improve the accuracy of the reconstruction method, the study of global optimiza-
tion algorithms, such Markov Random Field (Section ) is another future direction that
find an optimal solution for the correspondence problem in Section , where topolog-
ical and epipolar constraints will be considered as score functions.
63
• The performance of the proposed pose estimation method is highly dependent on
feature detection, description and matching algorithm as illustrated in Fig. 18b. Al-
though in our implementation, the SURF detector, descriptor and brute-force 2-NN
matching algorithm are sufficient to achieve a 30 Hz AR, this performance will de-
crease when applied to mobile devices, due to hardware limitation. The future work
is to find a balance between computational cost and reliability of the tracking method.
We will test other faster detectors, descriptors and matching algorithms, such as
FAST detector, BRISK, FREAK descriptor and Hamming distance based matcher (if
use binary descriptor) or FLANN’s hierarchical-clustering approach [75]. Another
routine is to consider feature matching as a classification problem. Instead of creat-
ing feature descriptors and matching algorithms, feature correspondences can also be
determined by classifiers [80] specifying the characteristics of feature points [39,60].
• As hardware of mobile devices evolves, acceleration methods are becoming popu-
lar to mobile applications. For example, modern mobile devices are equipped with
multi-core CPU and GPU. The acceleration methods on desktop can also be applied
to mobile devices, such as parallelization through CPU and GPU acceleration.
• The tracking and pose estimation method in our AR system requires the object of
interest to stay in the camera’s FOV though the entire process. However, this is not
always true, particularly for outdoor applications where occlusions and view angle
loss happens. Feature map that is used to store detected features, such as the one
in PTAM [56] and MonoSLAM [31] can be created and combined with optical flow
algorithms [59] to address this issue. As the system works around the scene, the
feature map dynamically updates itself. Even when occlusions occur, the lost feature
64
can be easily retrieved from a different angle in the map.
• Finally, we plan to compare our AR system with other markerless AR applications,
such as SfM-based [76, 90], SLAM-based [31], model-based [4, 27, 65] and hybrid
[59, 103] techniques, from speed, flexibility, robustness and portability perspectives.
65
References
[1] A. Alahi, R. Ortiz, and P. Vandergheynst. FREAK: Fast retina keypoint.Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 510–517, 2012.
[2] J. Arrasvuori, J. A. Holm, and A. J. Eronen. Personal augmented realityadvertising, Feb. 4 2014. US Patent 8,644,842.
[3] B. Babenko, M.-H. Yang, and S. Belongie. Robust Object Tracking with OnlineMultiple Instance Learning. IEEE Transactions on Pattern Analysis and MachineIntelligence, 33(8):1619–1632, 2011.
[4] H. Bae, M. Golparvar-Fard, and J. White. High-precision vision-based mobileaugmented reality system for context-aware architectural, engineering, constructionand facility management (AEC/FM) applications. Visualization in Engineering,1:3, 2013.
[5] I. n. Barandiaran, C. Paloc, and M. Grana. Real-time optical markerless trackingfor augmented reality applications. Journal of Real-Time Image Processing,5:129–138, 2010.
[6] A. Barry, J. Trout, P. Debenham, and G. Thomas. Augmented reality in a publicspace: The natural history museum, london. Computer, (7):42–47, 2012.
[7] M. Baudelet. Laser Spectroscopy for Sensing: Fundamentals, Techniques andApplications. Elsevier, 2014.
[8] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Speeded-up robust features (SURF).Computer vision and image . . . , 2008.
[9] J. L. Bentley. Multidimensional binary search trees used for associative searching.Communications of the ACM, 18(9):509–517, 1975.
[10] S. T. Birchfield, B. Natarajan, and C. Tomasi. Correspondence as energy-basedsegmentation. Image and Vision Computing, 25(8):1329–1340, 2007.
[11] C. Botella, J. Breton-Lopez, S. Quero, R. M. Banos, A. Garcia-Palacios,I. Zaragoza, and M. Alcaniz. Treating cockroach phobia using a serious game on amobile phone and augmented reality exposure: A single case study. Computers inHuman Behavior, 27(1):217–227, 2011.
[12] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization viagraph cuts. Pattern Analysis and Machine Intelligence, IEEE Transactions on,23(11):1222–1239, 2001.
[13] G. Bradski. The opencv library (2000). Dr. Dobb’s Journal of Software Tools, 2000.
[14] M. Brown and R. Szeliski. Multi-image feature matching using multi-scaleoriented patches. (December 2004), 2008.
66
[15] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua. BRIEF:Computing a Local Binary Descriptor very Fast. IEEE Transactions on PatternAnalysis and Machine Intelligence, 34(7):1281–1298, Nov. 2012.
[16] J. Canny. A Computational Approach to Edge Detection. IEEE Transactions onPattern Analysis and Machine Intelligence, PAMI-8(6), 1986.
[17] J. Carmigniani, B. Furht, M. Anisetti, P. Ceravolo, E. Damiani, and M. Ivkovic.Augmented reality technologies, systems and applications. Multimedia Tools andApplications, 51:341–377, 2011.
[18] K.-E. Chang, C.-T. Chang, H.-T. Hou, Y.-T. Sung, H.-L. Chao, and C.-M. Lee.Development and behavioral pattern analysis of a mobile guide system withaugmented reality for painting appreciation instruction in an art museum.Computers & Education, 71:185–197, 2014.
[19] C.-Y. Chen, B. R. Chang, and P.-S. Huang. Multimedia augmented realityinformation system for museum guidance. Personal and ubiquitous computing,18(2):315–322, 2014.
[20] S. Chen, Y. Li, and J. Zhang. Realtime structured light vision with the principle ofunique color codes. In Robotics and Automation, 2007 IEEE InternationalConference on, pages 429–434. IEEE, 2007.
[21] S. Chen, Y. Li, and J. Zhang. Vision processing for realtime 3-d data acquisitionbased on coded structured light. Image Processing, IEEE Transactions on,17(2):167–176, Feb 2008.
[22] W.-C. Chen, Y. Xiong, J. Gao, N. Gelfand, and R. Grzeszczuk. Efficient extractionof robust image features on mobile devices. In Proceedings of the 2007 6th IEEEand ACM International Symposium on Mixed and Augmented Reality, pages 1–2.IEEE Computer Society, 2007.
[23] K.-H. Cheng and C.-C. Tsai. Children and parents’ reading of an augmented realitypicture book: Analyses of behavioral patterns and cognitive attainment. Computers& Education, 72:302–312, 2014.
[24] R. Clouth. Mobile Augmented Reality as a Control Mode for Real-time MusicSystems. mtg.upf.edu, 2013.
[25] W. Commons. 3d scanning.https://commons.wikimedia.org/wiki/Category:3D_scanning#/media/File:3-proj2cam.svg.
[26] A. Comport, E. Marchand, and F. Chaumette. A real-time tracker for markerlessaugmented reality. In Mixed and Augmented Reality, 2003. Proceedings. TheSecond IEEE and ACM International Symposium on, pages 36–45, Oct 2003.
67
[27] A. A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Real-timemarkerless tracking for augmented reality: the virtual visual servoing framework.Visualization and . . . , 12(4):615–628, 2006.
[28] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt. 3d shape scanning with atime-of-flight camera. In Computer Vision and Pattern Recognition (CVPR), 2010IEEE Conference on, pages 1173–1180. IEEE, 2010.
[29] B. Curless and M. Levoy. A volumetric method for building complex models fromrange images. In Proceedings of the 23rd annual conference on Computer graphicsand interactive techniques, pages 303–312. ACM, 1996.
[30] a. J. Danker and a. Rosenfeld. Blob detection by relaxation. IEEE transactions onpattern analysis and machine intelligence, 3(1):79–92, 1981.
[31] A. Davison and I. Reid. MonoSLAM: Real-time single camera SLAM. PatternAnalysis and . . . , 2007.
[32] F. De Crescenzio, M. Fantini, F. Persiani, L. Di Stefano, P. Azzari, and S. Salti.Augmented reality for aircraft maintenance training and operations support.Computer Graphics and Applications, IEEE, 31(1):96–101, 2011.
[33] B. J. Dixon, M. J. Daly, H. Chan, A. D. Vescan, I. J. Witterick, and J. C. Irish.Surgeons blinded by enhanced navigation: the effect of augmented reality onattention. Surgical endoscopy, 27(2):454–461, 2013.
[34] S. O. Elberink and G. Vosselman. Quality analysis on 3d building modelsreconstructed from airborne laser scanning data. ISPRS Journal of Photogrammetryand Remote Sensing, 66(2):157–165, 2011.
[35] G. M. Farinella, S. Battiato, and R. Cipolla. Advanced Topics in Computer Vision.Springer London, London, 2013.
[36] P. Fechteler and P. Eisert. Adaptive color classification for structured light systems.In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEEComputer Society Conference on, pages 1–7, June 2008.
[37] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for earlyvision. International Journal of Computer Vision, 70:41–54, 2006.
[38] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981.
[39] B. Furht. Handbook of Augmented Reality. Springer Science & Business Media,2011.
[40] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng. Complete solution classificationfor the perspective-three-point problem. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 25(8):930–943, 2003.
68
[41] J. Geng. Structured-light 3D surface imaging: a tutorial, Mar. 2011.
[42] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz. Multi-view stereofor community photo collections. In Computer Vision, 2007. ICCV 2007. IEEE11th International Conference on, pages 1–8. IEEE, 2007.
[43] N. Haouchine, J. Dequidt, I. Peterlik, E. Kerrien, M.-O. Berger, and S. Cotin.Image-guided simulation of heterogeneous tissue deformation for augmentedreality during hepatic surgery. In Mixed and Augmented Reality (ISMAR), 2013IEEE International Symposium on, pages 199–208. IEEE, 2013.
[44] C. Harris and M. Stephens. A Combined Corner and Edge Detector. Procedings ofthe Alvey Vision Conference 1988, pages 147–151, 1988.
[45] C. Hilditch. Comparison of thinning algorithms on a parallel processor. Image andVision Computing, 1(3):115–132, 1983.
[46] B. Huang and Y. Tang. Fast 3d reconstruction using one-shot spatial structuredlight. In Systems, Man and Cybernetics (SMC), 2014 IEEE InternationalConference on, pages 531–536. IEEE, 2014.
[47] S. Ieiri, M. Uemura, K. Konishi, R. Souzaki, Y. Nagao, N. Tsutsumi, T. Akahoshi,K. Ohuchida, T. Ohdaira, M. Tomikawa, et al. Augmented reality navigation systemfor laparoscopic splenectomy in children based on preoperative ct image usingoptical tracking device. Pediatric surgery international, 28(4):341–346, 2012.
[48] B. Jamali, A. Sadeghi-Niaraki, and R. Arasteh. Application of geospatial analysisand augmented reality visualization in indoor advertising. 2015.
[49] A. Kaehler and B. Gary. Learning OpenCV. 2013.
[50] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-Learning-Detection. IEEETransactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, July2012.
[51] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for avideo-based augmented reality conferencing system. In Augmented Reality,1999.(IWAR’99) Proceedings. 2nd IEEE and ACM International Workshop on,pages 85–94. IEEE, 1999.
[52] H. Kawasaki, R. Furukawa, R. Sagawa, and Y. Yagi. Dynamic scene shapereconstruction using a single structured light pattern. In Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June2008.
[53] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. InProceedings of the fourth Eurographics symposium on Geometry processing, 2006.
69
[54] J. Keil, M. Zollner, M. Becker, F. Wientapper, T. Engelke, and H. Wuest. The houseof olbrichan augmented reality tour through architectural history. In Mixed andAugmented Reality-Arts, Media, and Humanities (ISMAR-AMH), 2011 IEEEInternational Symposium On, pages 15–18. IEEE, 2011.
[55] T. G. Kirner, F. M. V. Reis, and C. Kirner. Development of an interactive book withaugmented reality for teaching and learning geometric shapes. In InformationSystems and Technologies (CISTI), 2012 7th Iberian Conference on, pages 1–6.IEEE, 2012.
[56] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. InMixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACMInternational Symposium on, pages 225–234. IEEE, 2007.
[57] G. Klein and D. Murray. Parallel tracking and mapping on a camera phone. InMixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE InternationalSymposium on, pages 83–86. IEEE, 2009.
[58] R. J. Lapeer, S. J. Jeffrey, J. T. Dao, G. G. Garcıa, M. Chen, S. M. Shickell, R. S.Rowland, and C. M. Philpott. Using a passive coordinate measurement arm formotion tracking of a rigid endoscope for augmented-reality image-guided surgery.The International Journal of Medical Robotics and Computer Assisted Surgery,10(1):65–77, 2014.
[59] T. Lee and T. Hollerer. Multithreaded hybrid feature tracking for markerlessaugmented reality. IEEE Transactions on Visualization and Computer Graphics,15(3):355–368, 2009.
[60] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-time keypointrecognition. Proceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition, 2:775–781, 2005.
[61] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate O(n) solution to thePnP problem. International Journal of Computer Vision, 81(2):155–166, July 2009.
[62] S. Leutenegger, M. Chli, and R. Y. Siegwart. BRISK: Binary Robust invariantscalable keypoints. Proceedings of the IEEE International Conference onComputer Vision, pages 2548–2555, 2011.
[63] M. Lhuillier and L. Quan. Match propagation for image-based modeling andrendering. Pattern Analysis and Machine Intelligence, IEEE Transactions on,24(8):1140–1146, 2002.
[64] M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction fromuncalibrated images. Pattern Analysis and Machine Intelligence, IEEETransactions on, 27(3):418–433, 2005.
[65] J. Lima, F. Simoes, L. Figueiredo, V. Teichrieb, J. Kelner, and I. Santos. Modelbased 3d tracking techniques for markerless augmented reality. In XI symposium onvirtual and augmented reality, pages 37–47, 2009.
70
[66] J. Liu, J. M. White, and R. M. Summers. Automated detection of blob structures byhessian analysis and object scale. Proceedings - International Conference on ImageProcessing, ICIP, pages 841–844, 2010.
[67] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surfaceconstruction algorithm. In ACM siggraph computer graphics, volume 21, pages163–169. ACM, 1987.
[68] D. G. Lowe. Distinctive Image Features from. 60(2):91–110, 2004.
[69] S. P. McPherron, T. Gernat, and J. J. Hublin. Structured light scanning forhigh-resolution documentation of in situ archaeological finds. Journal ofArchaeological Science, 36(1):19–24, 2009.
[70] A. Menozzi, B. Clipp, E. Wenger, J. Heinly, E. Dunn, H. Towles, J.-M. Frahm, andG. Welch. Development of vision-aided navigation for a wearable outdooraugmented reality system. In Position, Location and NavigationSymposium-PLANS 2014, 2014 IEEE/ION, pages 460–472. IEEE, 2014.
[71] P. Milgram, H. Takemura, A. Utsumi, and F. Kishino. Augmented reality: A classof displays on the reality-virtuality continuum. In Photonics for industrialapplications, pages 282–292. International Society for Optics and Photonics, 1995.
[72] A. Ming and M. Huadong. A blob detector in color images. ACM InternationalConference on Image and Video Retrieval (CIVR), pages 364–370, 2007.
[73] H. P. Moravec. Obstacle avoidance and navigation in the real world by a seeingrobot rover. Technical report, DTIC Document, 1980.
[74] D. Moreno and G. Taubin. Simple, accurate, and robust projector-cameracalibration. In Proceedings - 2nd Joint 3DIM/3DPVT Conference: 3D Imaging,Modeling, Processing, Visualization and Transmission, 3DIMPVT 2012, pages464–471, 2012.
[75] M. Muja and D. G. Lowe. Fast matching of binary features. Proceedings of the2012 9th Conference on Computer and Robot Vision, CRV 2012, pages 404–410,2012.
[76] R. Newcombe and A. Davison. Live dense reconstruction with a single movingcamera. Computer Vision and Pattern . . . , 2010.
[77] T. Noll, A. Pagani, D. Stricker, T. Noell, and D. S. D. De. Markerless Camera PoseEstimation - An Overview. OASIcs-OpenAccess Series in Informatics, 2006:45–54,2010.
[78] O. V. Olesen, R. R. Paulsen, L. Hojgaard, B. Roed, and R. Larsen. Motion trackingfor medical imaging: a nonvisible structured light tracking approach. MedicalImaging, IEEE Transactions on, 31(1):79–87, 2012.
[79] H. Opower. Multiple view geometry in computer vision, volume 37. 2002.
71
[80] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recognition in ten lines of code.Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, 2007.
[81] J. Pages, J. Salvi, and C. Matabosch. Robust segmentation and decoding of a˜ gridpattern for structured light. In Pattern Recognition and Image Analysis, pages689–696. Springer, 2003.
[82] S. C. Park and M. Chang. Reverse engineering with a structured light system.Computers & Industrial Engineering, 57(4):1377–1384, 2009.
[83] W. H. Press. Numerical recipes 3rd edition: The art of scientific computing.Cambridge university press, 2007.
[84] L. Quan and Z. Lan. Linear n-point camera pose determination. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 21(8):774–780, 1999.
[85] G. Reitmayr and T. W. Drummond. Going out: Robust model-based tracking foroutdoor augmented reality. Proceedings - ISMAR 2006: Fifth IEEE and ACMInternational Symposium on Mixed and Augmented Reality, pages 109–118, 2007.
[86] R. Roberto, D. Freitas, J. a. P. Lima, V. Teichrieb, and J. Kelner. ARBlocks: Aconcept for a dynamic blocks platform for educational activities. In Proceedings -2011 13th Symposium on Virtual Reality, SVR 2011, pages 28–37, 2011.
[87] E. Rosten and T. Drummond. Machine learning for high-speed corner detection.Lecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics), 3951 LNCS:430–443, 2006.
[88] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternativeto SIFT or SURF. Proceedings of the IEEE International Conference on ComputerVision, pages 2564–2571, 2011.
[89] J. Salvi, S. Fernandez, T. Pribanic, and X. Llado. A state of the art in structuredlight patterns for surface profilometry. Pattern recognition, 43(8):2666–2680, 2010.
[90] M. Salzmann and P. Fua. Deformable Surface 3D Reconstruction from MonocularImages, Sept. 2010.
[91] C. Schaeffer. A Comparison of Keypoint Descriptors in the Context of PedestrianDetection: FREAK vs. SURF vs. BRISK. pages 1–5, 2012.
[92] D. Scharstein and R. Szeliski. Stereo matching with nonlinear diffusion.International journal of computer vision, 28(2):155–174, 1998.
[93] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-framestereo correspondence algorithms. International journal of computer vision,47(1-3):7–42, 2002.
72
[94] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structuredlight. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEEComputer Society Conference on, volume 1, pages I–195. IEEE, 2003.
[95] J. S. J. Shi and C. Tomasi. Good features to track. Computer Vision and PatternRecognition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer SocietyConference on, pages 593–600, 1994.
[96] G. Sun, C. Qiu, Q. Qi, K. Mu, and B. Wang. The implementation of a liveinteractive augmented reality game creative system. In Proceedings of the 2013Fifth International Conference on Multimedia Information Networking andSecurity, pages 440–443. IEEE Computer Society, 2013.
[97] I. E. Sutherland. Three-dimensional data input by tablet. Proceedings of the IEEE,62(4):453–461, 1974.
[98] R. Szeliski. Computer Vision : Algorithms and Applications. Computer, 5:832,2010.
[99] V. Teichrieb, M. Lima, E. Lourenc, S. Bueno, J. Kelner, and I. H. F. Santos. ASurvey of Online Monocular Markerless Augmented Reality. International Journalof Modeling and Simulation for the Petroleum Industry, 1(1):1–7, 2007.
[100] M. Torres, R. Jassel, and Y. Tang. Augmented reality using spatially multiplexedstructured light. In Mechatronics and Machine Vision in Practice (M2VIP), 201219th International Conference, pages 385–390, Nov 2012.
[101] B. Triggs. Detecting Keypoints with Stable Position , Orientation and Scale underIllumination Changes. Framework, pages 100–113, 2004.
[102] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3d machinevision metrology using off-the-shelf tv cameras and lenses. Robotics andAutomation, IEEE Journal of, 3(4):323–344, 1987.
[103] A. Ufkes and M. Fiala. A markerless augmented reality system for mobile devices.In Proceedings - 2013 International Conference on Computer and Robot Vision,CRV 2013, pages 226–233, 2013.
[104] A. Ulusoy, F. Calakli, and G. Taubin. Robust one-shot 3d scanning using loopybelief propagation. In Computer Vision and Pattern Recognition Workshops(CVPRW), 2010 IEEE Computer Society Conference on, pages 15–22, June 2010.
[105] T. van Aardenne-Ehrenfest and N. de Bruijn. Circuits and trees in oriented lineargraphs. In Classic papers in combinatorics, pages 149–163. Springer, 1987.
[106] A. Varol, A. Shaji, M. Salzmann, and P. Fua. Monocular 3D reconstruction oflocally textured surfaces. IEEE transactions on pattern analysis and machineintelligence, 34(6):1118–30, June 2012.
73
[107] A. Velten, T. Willwacher, O. Gupta, A. Veeraraghavan, M. G. Bawendi, andR. Raskar. Recovering three-dimensional shape around a corner using ultrafasttime-of-flight imaging. Nature Communications, 3:745, 2012.
[108] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm basedon immersion simulations. IEEE transactions on pattern analysis and machineintelligence, 13(6):583–598, 1991.
[109] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg. Real-timedetection and tracking for augmented reality on mobile phones. Visualization andComputer Graphics, IEEE Transactions on, 16(3):355–368, 2010.
[110] C. Wang, N. Komodakis, and N. Paragios. Markov Random Field modeling,inference & learning in computer vision & image understanding: A survey.Computer Vision and Image Understanding, 117(11):1610–1627, Nov. 2013.
[111] Wikipedia. Retina Display. http://en.wikipedia.org/wiki/RetinaDisplay/.
[112] B. Williams, G. Klein, and I. Reid. Real-time slam relocalisation. In ComputerVision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8.IEEE, 2007.
[113] Y. Xu and D. G. Aliaga. Modeling repetitive motions using structured light.Visualization and Computer Graphics, IEEE Transactions on, 16(4):676–689,2010.
[114] N. Yabuki, K. Miyashita, and T. Fukuda. An invisible height evaluation system forbuilding height regulation to preserve good landscapes using augmented reality.Automation in Construction, 20(3):228–235, 2011.
[115] H. Yang, G. Welch, and M. Pollefeys. Illumination insensitive model-based 3Dobject tracking and texture refinement. Proceedings - Third InternationalSymposium on 3D Data Processing, Visualization, and Transmission, 3DPVT 2006,pages 869–876, 2007.
[116] H. Yuen, J. Princen, J. Illingworth, and J. Kittler. Comparative study of HoughTransform methods for circle finding. Image and Vision Computing, 8(1):71–77,1990.
[117] L. Zhang, B. Curless, and S. M. Seitz. Rapid shape acquisition using colorstructured light and multi-pass dynamic programming. In 3D Data ProcessingVisualization and Transmission, 2002. Proceedings. First International Symposiumon, pages 24–36. IEEE, 2002.
[118] Z. Zhang. A flexible new technique for camera calibration. Pattern Analysis andMachine Intelligence, IEEE . . . , 1998, 2000.
[119] F. Zheng, R. Schubert, and G. Weich. A general approach for closed-loopregistration in AR. Proceedings - IEEE Virtual Reality, pages 47–50, 2013.
74
[120] F. Zhou, H. B. L. Dun, and M. Billinghurst. Trends in augmented reality tracking,interaction and display: A review of ten years of ISMAR. Proceedings - 7th IEEEInternational Symposium on Mixed and Augmented Reality 2008, ISMAR 2008,pages 193–202, 2008.
75
76
Appendix A
Hamming-Distance-Based Amendment
1: for i = 0→ `(P )− 1 do
2: for d = 0→ 2 do
3: Initialize
4: for q = 0→ 3 do
5: πvdq = 0
6: end for
7: end for
8: λ = i
9: Assign value to each element of Πv
10: while χ(ϑ(vλ)) = 0 and Nbeast(vλ) 6= NULL do
11: πv11 = ξv(vλ)
12: if Nbwest(vλ) 6= −1 then
13: πv10 = ξv(Nbwest(vλ))
14: end if
15: if Nbnorth(vλ) 6= NULL then
16: πv01 = ξv(Nbnorth(vλ))
17: end if
18: if Nbeast(vλ) 6= NULL then
19: πv12 = ξv(Nbeast(vλ))
20: end if
77
21: if Nbsouth(vλ) 6= NULL then
22: πv21 = ξv(Nbsouth(vλ))
23: end if
24: if Nbnorth(Nbwest(vλ)) 6= NULL then
25: πv00 = ξv(Nbnorth(Nbwest(vλ)))
26: else if Nbwest(Nbnorth(vλ))) 6= NULL then
27: πv00 = ξv(Nbwest(Nbnorth(vλ)))
28: end if
29: if Nbnorth(Nbeast(vλ))) 6= NULL then
30: πv02 = ξv(Nbnorth(Nbeast(vλ)))
31: else if Nbeast(Nbnorth(vλ)) 6= NULL then
32: πv02 = ξv(Nbeast(Nbsouth(vλ)))
33: end if
34: if Nbsouth(Nbeast(vλ))) 6= NULL then
35: πv22 = ξv(Nbsouth(Nbeast(vλ)))
36: else if Nbeast(Nbsouth(vλ))) 6= NULL then
37: πv22 = ξv(Nbeast(Nbsouth(vλ)))
38: end if
39: if Nbsouth(Nbwest(vλ))) 6= NULL then
40: πv20 = ξv(Nbsouth(Nbwest(vλ)))
41: else if Nbwest(Nbsouth(vλ))) 6= NULL then
42: πv20 = ξv(Nbwest(Nbsouth(vλ)))
43: end if
78
44: if Nbeast(Nbeast(vλ))) 6= NULL then
45: πv13 = ξv(Nbeast(Nbeast(vλ)))
46: end if
47: if Nbnorth(Nbeast(Nbeast(vλ))) 6= NULL then
48: πv03 = ξv(Nbnorth(Nbeast(Nbeast(vλ))))
49: else if Nbeast(Nbnorth(Nbeast(vλ))) 6= NULL then
50: πv03 = ξv(Nbeast(Nbnorth(Nbeast(vλ))))
51: end if
52: if Nbsouth(Nbeast(Nbeast(vλ))) 6= NULL then
53: πv23 = ξv(Nbsouth(Nbeast(Nbeast(vλ))))
54: else if Nbeast(Nbsouth(Nbeast(vλ))) 6= NULL then
55: πv23 = ξv(Nbeast(Nbsouth(Nbeast(vλ))))
56: end if
57: Πv =
π00 π01 π02 π03
π10 π11 π12 π13
π20 π21 π22 π23
58: s∗j = arg minsj(H(Πv, sj))
59: vλ = Nbeast(vλ)
60: Create virtual neighbors with color labels if needed
61: if Nbsouth(vλ) = NULL then
62: rvn = a∗1j
63: end if
64: if Nbsouth(vλ) = NULL then
79
65: rvs = a∗1j
66: end if
67: if Nbsouth(vλ) = NULL then
68: rvw = a∗0j
69: end if
70: if Nbsouth(vλ) = NULL then
71: rve = a∗2j
72: end if
73: end while
74: vλ = Nbeast(vλ)
75: end for
80