A markerless augmented reality system using one-shot structured light

Rowan University Rowan University

Rowan Digital Works Rowan Digital Works

Theses and Dissertations

12-3-2015

A markerless augmented reality system using one-shot structured A markerless augmented reality system using one-shot structured

light light

Bingyao Huang

Follow this and additional works at: https://rdw.rowan.edu/etd

Part of the Electrical and Computer Engineering Commons

Let us know how access to this document benefits you - share your thoughts on our feedback form.

Recommended Citation Recommended Citation Huang, Bingyao, "A markerless augmented reality system using one-shot structured light" (2015). Theses and Dissertations. 4. https://rdw.rowan.edu/etd/4

This Thesis is brought to you for free and open access by Rowan Digital Works. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Rowan Digital Works. For more information, please contact [email protected].

https://rdw.rowan.edu/

https://rdw.rowan.edu/etd

https://rdw.rowan.edu/etd?utm_source=rdw.rowan.edu%2Fetd%2F4&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/266?utm_source=rdw.rowan.edu%2Fetd%2F4&utm_medium=PDF&utm_campaign=PDFCoverPages

https://www.lib.rowan.edu/rdw-feedback?ref=https://rdw.rowan.edu/etd/4

https://www.lib.rowan.edu/rdw-feedback?ref=https://rdw.rowan.edu/etd/4

https://rdw.rowan.edu/etd/4?utm_source=rdw.rowan.edu%2Fetd%2F4&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

A MARKERLESS AUGMENTED REALITY SYSTEM USING ONE-SHOTSTRUCTURED LIGHT

by

Bingyao Huang

A Thesis

Submitted to theDepartment of Electrical & Computer Engineering

College of EngineeringIn partial fulfillment of the requirement

For the degree ofMaster of Science in Electrical & Computer Engineering

atRowan University

June 30, 2015

Thesis Chair: Ying Tang

Acknowledgments

I would like to express my sincere gratitude to my advisor Dr. Ying Tang for the

continuous support of my Masters study and research, for her patience, motivation, enthu-

siasm, and immense knowledge. She taught me how to think critically, question thoughts

and express ideas. Her guidance helped me in all the time of research and writing of this

thesis. Besides the care in study and research, her generous support helped me pass through

the hard times when I was trying to adapt to a new life in this country. I could not have

imagined having a better advisor and mentor for my Masters study. I hope that one day I

would become as good an advisor to my students as Dr. Tang has been to me.

Besides my advisor, I would like to thank the rest of my thesis committee: Dr.

Haibin Ling and Dr. Ravi Ramachandran for their encouragement, insightful comments,

and practical advice.

My sincere thanks also goes to Mr. Karl Dyer for the generous help on the 3D-

printed camera mount and slider, without his help I would have not finished the system test

and data collection.

I am also grateful to my friend Christopher Franzwa, his support and care helped me

overcome setbacks when I first came this country. I greatly value his friendship.

I would also like to thank my family for the support they provided me through my

entire, without your love, encouragement and editing assistance, I would not have finished

this thesis.

iii

Abstract

Bingyao Huang

A MARKERLESS AUGMENTED REALITY SYSTEM USING ONE-SHOTSTRUCTURED LIGHT

2014-2015Ying Tang, Ph.D.

Master of Science in Electrical & Computer Engineering

Augmented reality (AR) is a technology that superimposes computer-generated 3D

and/or 2D information on the user’s view of a surrounding environment in real-time, en-

hancing the user’s perception of the real world. Regardless of the field for which the ap-

plication is applied, or its primary purpose in the scene, many AR pipelines share what

might be thinned down to two specific goals, the first being range-finding the environment

(whether this be in knowing a depth, precise 3D coordinates, or a camera pose estimation),

and the second being registration and tracking of the 3D environment, such that an envi-

ronment moving with respect to the camera can be followed. Both range-finding and track-

ing can be done using a black and white fiducial marker (i.e., marker-based AR) or some

known parameters about the scene (i.e., markerless AR) in order to triangulate correspond-

ing points. To meet users’ needs and demand, range-finding, registration and tracking must

follow certain standards in terms of speed, flexibility, robustness and portability. In the past

few decades, AR has been well studied and developed to be robust and fast enough for real-

time applications. However, most of them are limited to certain environment or require a

complicated offline training. With the advancement of mobile technology, users expect AR

to be more flexible and portable that can be applied in any uncertain environment. Based

on these remarks, this study focuses on markerless AR in mobile applications and proposes

an AR system using one-shot structured light (SL). The markerless AR system is validated

iv

in terms of its real time performance and ease of use in unknown scenes.

v

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Research Background and Objectives . . . . . . . . . . . . . . . . . . . . . . . . 1

Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Display devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Virtual contents registration and rendering . . . . . . . . . . . . . . . . . . . 8

Marker-based AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Markerless-based AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3D Object Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

System calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Acquiring 2D images from different views . . . . . . . . . . . . . . . . . . . 12

Finding correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Surface construction and optimization . . . . . . . . . . . . . . . . . . . . . 12

Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Active range sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

vi

Table of Contents (Continued)

Structured Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Markerless Tracking and Pose Estimation . . . . . . . . . . . . . . . . . . . . . 18

Feature detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Feature description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Chapter 3: Fast 3D Reconstruction using One-Shot Spatial Structured Light . . . . . 32

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3D Reconstruction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Color-coded structured light . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Image processing for grid extraction . . . . . . . . . . . . . . . . . . . . . . 36

Color labeling and correction . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Neighbor completion using hamming distance . . . . . . . . . . . . . . . . . 39

Chapter 4: Markerless Tracking and Camera Pose Estimation . . . . . . . . . . . . . 42

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

System Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Feature Based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Camera Tracking and Pose Estimation Using Perspective-n-Points . . . . . . . . 49

Chapter 5: Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vii

Markerless Tracking and Pose Estimation . . . . . . . . . . . . . . . . . . . . . 55

Processing Speed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Chapter 6: Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 62

Discussion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Chapter A:Hamming-Distance-Based Amendment . . . . . . . . . . . . . . . . . . . 77

viii

List of Figures

Figure Page

Figure 1. HMD displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 2. Marker-based AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Figure 3. Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Figure 4. Range data scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Figure 5. Corner definition by Harris [44] . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 6. Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 7. RANSAC used to filter matching ourliers . . . . . . . . . . . . . . . . . . 31

Figure 8. System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 9. Color Grid Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 10.The proposed markerless AR system . . . . . . . . . . . . . . . . . . . . 42

Figure 11.Converting from object to camera coordinate systems . . . . . . . . . . . . 44

Figure 12.System setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 13.SL reconstruction process . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 14.Reconstructed point cloud and mesh . . . . . . . . . . . . . . . . . . . . . 55

Figure 15.Camera tracking and pose estimation using PnP . . . . . . . . . . . . . . . 57

Figure 16.Accuracy of the proposed markerless pose estimation . . . . . . . . . . . . 59

Figure 17.Frame rate (fps) of the proposed method presented . . . . . . . . . . . . . 60

Figure 18.Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

ix

List of Tables

Table Page

Table 1. Camera and projector intrinsic parameters . . . . . . . . . . . . . . . . . . 52

Table 2. Camera-projector pair rotation matrix . . . . . . . . . . . . . . . . . . . . 52

Table 3. Camera-projector pair translation vector . . . . . . . . . . . . . . . . . . . 54

Table 4. Accuracy of the markerless pose estimation . . . . . . . . . . . . . . . . . 58

.

x

Chapter 1

Introduction

Research Background and Objectives

We live in a three-dimensional spatial world that is perceived by our vision system, where

the objects we see is tangible and real. With the development of technology, digital in-

formation is brought to human, however, this kind of intangible virtual information is less

intuitive for human to understand due to the limitation of human vision system and brain.

Interfaces and tools have been invented and developed in an attempt to bring us closer to

the exponentially increasing digital information, one of which is Augmented Reality (AR)

that brings the real and the virtual together (i.e., computer generated graphics/video). AR

has many applications in our life. Regardless of the field for which the application is ap-

plied, or its primary purpose in the scene, many AR pipelines share what might be thinned

down to two specific goals, the first being range-finding the environment (whether this be in

knowing a depth, precise 3D coordinates, or a camera pose estimation), and the second be-

ing registration and tracking of the 3D environment, such that an environment moving with

respect to the camera can be followed. To meet users’ need, the design of range-finding,

registration and tracking must consider speed, flexibility, robustness and portability. In the

past few decades, AR has been well studied and developed to be robust and fast enough for

real-time applications. But its flexibility and portability face a big challenge as most of the

AR development is either limited to a certain environment or requires complicated offline

training. As mobile devices become popular, users expect a more flexible and portable AR

that can be applied in any uncertain environment. Based on these remarks, this thesis fo-

1

cuses on AR in mobile applications. In particular, we proposes a 3D reconstruction (range-

finding) method using structured light (SL). We then apply this method to our proposed

AR system that allows fast one-time range-finding and accurate tracking in real-time, and

easy-of-use in arbitrary unknown scenes without any offline training and/or add-on sensors

needed.

Research Contribution

The goal of this research is to provide proof-of-the-concept of a markerless AR system that

can be performed in real-time (30Hz) on mobile devices. In particular, the system attempts

to accomplish the range-finding component without using a priori or comparative data set

of the scene, but rather a known projection of structured light (SL). By doing this, range-

finding can be accomplished for any unknown environment, requiring only one instance

of data regarding the scene. In addition, the proposed method can be easily embedded

on camera-equipped (with a flash light) mobile devices. Our specific contributions are as

follows.

This work is primarily a proof-of-concept that robust markerless AR can be performed

in real-time (30 Hz) on mobile devices. Our main contributions are as follows:

• We propose a flexible and portable AR system that can be used in any unknown

scenes with no offline training.

• We propose an SL-based range finding method to reconstruct exact 3D information

of the real world

• We demonstrate that a mobile device with a camera and flash light through a film is

suffice to apply our real-time markerless AR system

2

Organization

This thesis is organized as follows. Chapter 2 describes the related work and techniques

in AR with a focus on markerless (mobile) AR. The general 3D reconstruction process

of our proposed AR system is presented in Chapter 3, followed by the detail on tracking,

camera pose estimation and graphics rendering in Chapter 4. Chapter 5 gives the testing

and evaluation of the proposed system in comparison with the ground truth. Finally, the

conclusions and future research directions are presented in Chapter 6.

3

Chapter 2

Literature Review

Augmented Reality

Augmented reality (AR) superimposes computer-generated 3-D graphics or 2-D informa-

tion on the user’s view of a surrounding environment in real-time, enhancing the user’s

perception of the real world. AR is both interactive and registered in 3D as well as com-

bines real and virtual objects [17]. In the definition of Milgram’s Reality-Virtuality Con-

tinuum [71] defined by Paul Milgram and Fumio Kishino there is a continuum that spans

between the real environment and the virtual environment comprise Augmented Reality

and Augmented Virtuality (AV) in between, where AR is closer to the real world and AV is

closer to a pure virtual environment.

Another definition of AR is given by Azuma et al. [17] that AR is not restricted to a

particular type of display technologies such as head-mounted display (HMD), nor limited

to the sense of sight. AR can potentially apply to all senses, augmenting smell, touch and

hearing, etc. With this context, our thesis only focuses on computer vision-based AR that

estimates camera pose (localization, orientation) using visual tracking and view geometry

constraints without any add-on sensors other than a camera. 3D virtual objects are then

rendered from the same viewpoint that the images of the real scene are being taken by the

tracking camera. The term of AR in the rest of this thesis is referred as computer vision-

based Augmented Reality.

The goal of an AR application could be to complement existing information in the

scene, such as overlaying augmented text on an historic building [54, 114] or giving more

4

Figure 1. HMD display.

educational information in a museum [6, 18, 19] or a book [23, 55]. AR also finds its ap-

plications in image guided surgery [33, 43, 58], reverse engineering [32], navigation [47],

advertising and commercial [2, 48]. Another purpose could be to simply draw attention to

the virtual data such as an entertainment game [11, 96]. Among all the applications, there

are three essential key factors to account for a reliable and flexible AR system: display

devices on which virtual contents are shown to users; tracking and camera pose estimation

mechanism such that an environment moving with respect to the camera can be followed;

and virtual content registration and rendering for proper display of the augmented informa-

tion on devices.

Display devices. The display technologies can be grouped into three categories:

head-mounted display (HMD), projective and handheld.

HMDs are wearable display devices on the head or as part of a helmet that places

the real scene and images of virtual environment over a user’s view of the world (Fig. 1).

HMD can either be video-see-through or optical see-through with a monocular or binocular

5

display. Video see-through HMDs provide users with a video view of both real world and

virtual graphics. This technology requires at least one digital camera that captures the

real scene and passes it through the video display to users. Given that the real and virtual

contents are both processed as digital images, there are no synchronization issues. The

optical see-through employs an optical mirror to allow views of physical world to be passed

through the lens and graphically overlay information to be reflected in the user’s eyes, e.g.,

Google Glass. Since the optical see-through physically display real scenes with digitally-

displayed virtual contents, synchronization of both displays is a technique challenge.

Projective displays project virtual information directly onto a real object with either

video-projectors, for example, ARBlocks [86], optical elements, holograms and radio fre-

quency tags, etc., or other tracking technologies. One of the potential advantages of projec-

tive displays is that they can be used without a wearable display device. But the down side

is that the necessary set-up makes it inconvenient for outdoor uses. Another shortcoming

with this type of displays is that cameras and projectors are difficult to synchronize under

different lighting conditions.

Handheld displays are typically mobile devices such as smart phones, tablets and watches.

The devices are usually attached with a camera or paired with other image acquisition and

processing devices to stream videos to the display. Handheld displays are highly suitable

for mobile AR applications due to their portability, mobility, ubiquity and low cost, owing

to the development of the mobile computing technology recently [24]. They use video-see-

through techniques to superimpose virtual contents onto the real environment and employ

built-in sensors, such as digital compasses, accelerometer, gyroscope, barometer and/or

GPS units to help and enhance their six degree of freedom (DoF) tracking. With the devel-

6

opment of digital camera, high definition screen such as retina display [111], and powerful

CPU, GPU and RAM the smart phones and tablets are capable of real-time AR applications

with high definition displays. However, their small display is the limitation for interactive

3D user experience. A promising solution is to combine HMDs with handheld devices,

so the HMDs serve as image acquisition and displays while handheld devices serve for

computing and processing server.

Camera pose estimation. Precise geometric registration of virtual contents con-

vinces observers to accept the virtual content as part of the real environment rather than

separated or overlaid illusion of information. In most AR applications, this registration

requires a correct and real-time estimation of the user’s viewpoint with respect to the co-

ordinate system of the virtual contents [39]. Namely, the camera pose with respect to the

real scene must be known at each frame to be rendered. The camera pose can be estimated

using an appropriate 6-Dof tracking system, where a 3D rigid transformation of the cam-

era localization in the scene is calculated, including 3D rotation (3-Dof) and 3D translation

(3-Dof) in the world coordinate system. This process usually involves 3D reconstruction

(or range-finding) and visual tracking. Given that the camera pose with respect to the real

scene changes over time, camera pose estimation must be performed fast enough to support

real-time tracking.

In sensor based AR, optical(laser, IR), accelerometer, gyrometer, GPS, magnetic, ul-

trasonic sound techniques are applied/combined to enable a reliable tracking mechanism.

However, in computer vision based AR, the relative localization between the camera and

real scene in computer vision-based AR changes in real-time. Thus, a camera pose estima-

tion algorithm must be performed fast to meet real-time tracking requirement.

7

Virtual contents registration and rendering. This process consists of generating

and superimposing virtual contents using the estimated camera pose in real-time. The vir-

tual contents may be modeled offline, for example, characters in an AR game, or generated

online, such as, the wearable outdoor navigation application [70].

Out of the aforementioned three key factors, our thesis primarily focuses on tracking

and camera pose estimation in unknown circumstances. Range-finding and tracking can be

done using a fiducial marker or some natural features or parameters about the scene in order

to triangulate corresponding points. The former method is referred to as a marker-based

AR system while the latter is known as a markerless system.

Marker-based AR. The use of fiducial markers, although important to provide a

robust tracking mechanism, presents a couple of drawbacks. The major restriction is that

markers often cannot be attached to objects that are supposed to be augmented. While this

problem might be solved using more than two cameras (at least one for markers tracking

and the other for image taking), this significantly reduces the overall tracking and regis-

tration quality. In addition, such an approach is not suitable for most mobile devices like

mobile phones. For instance, the use of markers may pose sometimes unwanted objects on

the outputting video frames, which can reduce the visual aesthetic or get in the way of a

desired view, particularly in mobile devices where resolution and/or field of view is limited.

Finally, applying markers at all target locations to be augmented can be time-consuming

like a large scale scene or even impossible, for example, in outdoor intangible scenarios.

Thus, in a mobile scenario markerless tracking approaches using natural features of the

environment to be augmented for tracking are preferable.

8

Figure 2. Marker-based AR example generated using ARtoolkit [51].

Markerless-based AR. Existing markerless AR applications extract pre-stored spa-

tial information from objects followed by mapping invariant feature points to calculate the

pose [26, 65, 99]. Strong feature detectors are usually used to identify target points and

the random sample consensus (RANSAC) [38] algorithm is then employed to validate the

matching of feature points. Once the features are matched, a perspective transformation

is used to estimate camera’s translation and rotation poses with respect to the object from

frame to frame. Markerless pose trackers mainly rely on natural feature points (sometimes

called keypoints or interest points) visible in the user’s environment, so its real-time camera

pose estimation becomes challenging.

A perusal of current literature provides a number of prominent markerless AR tech-

niques, which can be categorized into three types: structure from motion (SfM), Parallel

Tracking and Mapping (PTAM) and model-based systems.

SfM tries to work on a completely unknown scene and uses motion analysis and some

known camera parameters to calculate the camera pose [4, 31]. Using MonoSLAM [31] as

9

an example, it applies SfM to do simultaneous localization and mapping (SLAM), where

the probabilistic 3D map is initialized using a planar target containing four known features

and then dynamically updated by using Extended Kalman Filter (EKF).

PTAM system [56] uses a stereo pair of images of a planar scene for initialization,

and bundle adjustment is then applied to determine and update the 3D locations of interest

points.

Model-based methods are considered robust against noise and efficient that make use

of models of the tracked objects, such as CAD models or 2D templates of items with

distinguishable features [120] (e.g., lines, circles, cylinders and spheres) [27, 119]. Once a

connection is made between the 2D image and 3D world frame, it is possible to calculate

the camera pose by projecting the features coordinates in 3D to 2D image coordinates.

An optimal pose estimation is found by minimizing the distance between the projected

2D coordinates and the corresponding observed 2D features coordinates. [27] describes a

real-time 3D model-based tracking algorithm for a ”video see through” monocular vision

system. This model-based method introduces an tracking algorithm that uses implicit 3D

information to predict hidden movement of the object thus improving the robustness and

performance. [115] proposes an illumination insensitive model-based 3D object tracking

system, where a Kalman filter is used to recursively refine the position and orientation. [85]

applies a hybrid (edge-based and texture-based) tracking system for out-door augmented

reality, where the system operates at about 15-17 frames on a handheld device. Recently,

[106] shows deformable surfaces like paper sheets, t-shirts and mugs can also serve as

potential augmentable surfaces. However, there is still space to improve its robustness and

minimize the computational costs.

10

These methods accomplish tracking in parallel with range-finding. They either require

an initialization stage for the 3D model of the target object or require additional hardware

for image processing to achieve real-time AR. For instance, [27] acquires the 3D model

of the target object in priori, making it impossible to work in uncertain environment; [4]

uses a server for online image processing and 3D point cloud reconstruction and Wi-Fi

communication with mobile device. In short, all of the methods need at least some amount

of the spatial environment information to be known either a-priori, or in a sequence of

video frames, to accomplish the actual range-finding.

Several attempts have been made to improve the current markerless AR techniques for

mobile applications. However the performance is still not sufficient enough. A modified

version of SURF is proposed in [22], but far from real-time on mobile devices. Wagner et

al propose a feature detection and matching system for mobile AR [109], where a modi-

fied version of SIFT and outliers rejection algorithms speed up the computation to achieve

approximately 26 FPS. An AR application has been adapted to iPhone [57], where a re-

duced version of the PTAM is adopted. However, the method has shown severely reduced

accuracy and low execution speed. [103] proposes a markerless augmented reality system

for mobile devices with only a monocular camera. To enable a correct 3D-2D match, the

EPnP [61] technique is used inside a RANSAC loop and inlier tracking uses optical flow.

3D Object Reconstruction

A prerequisite to estimate camera pose is the knowledge of the 3D coordinates of a ref-

erence object. This can be done through 3D reconstruction (rang-finding), a process of

capturing the real shape and appearance of the object. 3D reconstruction has wide applica-

11

tions in medical and industrial fields, and has also been largely used to generate models for

virtual and augmented reality. The process usually involves five major steps:

System calibration. A 3D reconstruction system consists of at least one digital cam-

era. To properly register the point cloud, the transformation between image frame, camera

frame and world frame must be known, which is expressed as camera intrinsic and extrin-

sic parameters. The process of obtaining those camera intrinsic and extrinsic parameters is

called calibration.

Acquiring 2D images from different views. To completely reconstruct the surface

of the entire object, often 8-10 images of the object from different views are needed.

Finding correspondence. A common method to get range data is to compare the 2D

images and find the correspondences between each feature point through feature matching

under camera poses constraints.

Registration. Given the perspective transformations from each view to object, the

3D coordinates are uniquely determined by its 2D correspondences. This is often com-

pleted by triangulation. Then the range data is transformed and combined together to form

a single point cloud in the 3D world coordinate system. The whole process is called regis-

tration.

Surface construction and optimization. Once the point cloud is created, a con-

nected mesh can be produced by connecting the 3D points to form edges. There are many

well-established algorithms to do so, such as, marching cube algorithm [67], Poisson Sur-

face Reconstruction [53] or the one in [46] that uses the neighborhood relationship of ex-

isting points.

In the rest of this section, three main 3D reconstruction methods are reviewed: stereo

12

vision, time of flight and structured light

Stereo Vision. Stereo vision system consists of two or more calibrated cameras

capturing digital images of objects from different views. The captured images share some

common object information, whose sparse correspondences or dense correspondences are

used to match the prominent parts, namely corners, edges, or patch of the object, for tri-

angulation. The former one is called feature-based matching, which first extracts a set

of potentially matchable image locations using either interest operators or edge detectors,

and then searches for corresponding locations in other images using patch-based metrics.

More recent work in this area has focused on the use of extracted, highly reliable features as

seeds to grow additional matches [63]. Similar approaches have also been extended to wide

baseline multi-view stereo problems and combined with 3D surface reconstruction [42,64].

While sparse matching is still occasionally used, most stereo vision today focuses on dense

correspondence, since it is required for many applications such as image-based rendering

or modeling. Such dense correspondence, often called block matching, consists of the fol-

lowing four steps: matching cost/score computation for each block pair, cost minimization

(score optimization), disparity assignment according to the minimal cost match and dis-

parity refinement. Usually the matching cost is the sum-of-squared-differences (SSD) or

correlation between two blocks from two images (Fig.3). To get a better result, global op-

timization algorithms are studied, such as probabilistic (mean-field) diffusion [92], expecta-

tion maximization (EM) [10], graph cuts [12], or loopy belief propagation [35,37,104,110].

Those algorithms view a digital image as a graph model and make explicit smoothness as-

sumptions that the image is smooth between two adjacent pixels. Finally, the matching is

solved as a global optimization problem.

13

Figure 3. Block matching of Tsukuba dataset. Tsukuba dataset obtained from [93].

Active range sensing. Stereo vision uses passive video cameras, so it is also called

passive 3D reconstruction. When applied to a scene in poor light condition or with tex-

tureless objects, for example, a plaster bust, this method usually has difficulty to find and

match features, even with the epipolar constraint. Active 3D reconstruction, on the other

hand, tackles this problem well. It uses light source to add an artificial pattern or texture

on objects, thus greatly improving the precision of feature matching. This active illumina-

tion technique has been successfully applied to 3D laser scanning [34] and structured light

scanning [104] with high reliability.

One of the active methods is called time-of-flight that measures the time a light signal

flies between the light sensor and objects of interest. The theory behind the method is

simple. When the illumination source is switched on for very short time and sends out a

single light pulse or pulses, the resulting light pulse reaches the scene and is reflected by

the objects in the field of view. The sensor then captures the reflected light. Depending

upon the distance between illumination source and the object, there is a delay, T , for the

camera to capture the reflected light, which is a function of the distance, D, and the speed

14

of light, c:

T = 2×D/c

Thus, the distance is only related to the delay and a constant c

D = Tc/2

where T is the delay and D is the distance from the illumination source to the object.

A good example of this system is LIDAR (LIght Detection And Ranging) [7], similar

to RADAR(RAdio Detection And Ranging), a remote sensing method that uses a spatially

narrow laser beam to measure ranges. To account for sensor noise, [28] describes an in-

tegrated method that combines a 3D super-resolution with a probabilistic scan alignment

approach.

Velten et al. [107] proposes a time-of-flight 3D reconstruction method that use diffusely

reflected light, the authors claim it achieves sub-millimeter depth precision and centimeter

lateral precision over 40 cm×40 cm×40 cm of hidden space.

Another popular method uses a laser or light stripe sensor to sweep a plane of light

across the scene or objects while observing it from an offset viewpoint [29]. A stripe

is observed as an intersection between the light plane and object surface. As the stripe

sweeps across the object, it deforms its shape according to the object’s surface shape. It is

then a simple matter of using optical triangulation to estimate the 3D locations of all the

points seen in a particular stripe. (Fig.4)

Structured Light. Structured light 3D reconstruction method is known for its ro-

bustness against noise and correspondence unambiguity. Instead of using a single video

15

Figure 4. Range data scanning. Figure obtained from [25].

camera, structured light allows an active illumination sensor (e.g. projector) to project en-

coded stripe patterns onto objects that create synthetic texture. Structured light follows the

same routine as stereo systems if considering the active illumination sensor(e.g. projector)

as an inverse camera. In terms of the correspondence problem in stereo vision based re-

construction, the synthetic texture helps match textureless surfaces, thus producing a better

correspondence of features captured by two cameras.

Projecting known encoded patterns makes the correspondence between pixels unam-

biguous and is robust against occlusions [94]. This technique has been used to produce

large numbers of highly accurate registered multi-image stereo pairs and depth maps for

the purpose of evaluating stereo correspondence algorithms and learning depth map priors

and parameters [98].

3D reconstruction using structured light has been well studied in the past few decades

due to its wide applications in reverse engineering [82], augmented reality [100], medical

16

imaging [78] and archaeological finds [69]. In terms of codification strategy structured light

can be classified into four general types: discrete spatial multiplexing, time-multiplexing,

continuous frequency multiplexing and continuous spatial multiplexing [89]. When con-

sidering the projections, structured light can be sequential (multiple-shot) or single-shot.

More detail can be referred to [41].

Most of the early work focuses on a temporal approach that requires multiple patterns

to be projected consecutively onto stationary objects. Obviously, such requirement makes

temporal approaches unsuitable for mobile and real-time applications.

Recently, researchers have devoted lots of efforts in speeding up the data acquisition

process by designing techniques that need only a handful of input images or even a single

one, so-called ”one-shot” 3D image acquisition. For example, Zhang et al. [117] pro-

pose a multi-pass dynamic programming algorithm to solve the multiple hypothesis code

mating problem and successfully apply to both one-shot and spacetime methods. Ali et

al. [104] model a spatial structure light system using a probabilistic graphical formulation

with epipolar, coplanarity and topological constraints. They then solve the correspondence

problem by finding a maximum posteriori using loopy belief propagation. Kawasaki et

al. [52] use a bicolor grid and local connectivity information to achieve dense shape recon-

struction. Given that the proposed technique does not involve encoding positional informa-

tion into multiple pixels or color space, it provides good results even when discontinuities

and/or occlusions are present. Similarly, Chen et al. [21] present a specially-coded vision

system where a principle of uniquely color-encoded pattern is proposed to ensure the re-

construction efficiency using local neighborhood information. To improve color detection,

Fechteler and Eisert [36] propose a color classification method where the color classifi-

17

cation is made adaptive to the characteristics of the captured image, so distortion due to

environment illumination, color cross-talk, and reflectance is well compensated.

In spite of the aforementioned development, there is still space to further improve one-

shot methods, particularly in terms of speed. For example, the method reported in [104],

takes 10 iterations to converge, which costs about 3 minutes to recover thousand intersec-

tions. Similarly, the approach in [36] takes a minute to reconstruct an object with 126544

triangles and 63644 vertices.

Markerless Tracking and Pose Estimation

Given tracking and pose estimation are vital to our project, this section provides an overview

of the current development in those areas and explain in detail the key steps involved for

creating an appropriate computer vision based markerless tracking system. Unlike most

marker-based tracking methods, markerless tracking is designed for uncontrolled and un-

known environment, where the tracking process works without adjusting the object or the

environment to be tracked, such as placing fiducial landmarks or references. The natural

features of the scene such as textures, blobs, corners, edges, lines and circles are used to

uniquely identify the objects. In general this markerless tracking process involves four

main mechanisms, feature detection, description, matching and pose estimation. The

first three steps find strong features of the object and produce description/feature vectors of

the target object. The target is then tracked over the frames by matching descriptors/feature

vectors. Provided the tracking information and camera initial pose in a 3D coordinate sys-

tem, the camera’s new pose with respect to the object is determined. To account for a robust

pose estimation, several conditions must hold:

18

• Robustness: the tracking must be robust against noise and able to recover from failure

• Performance: the tracking must be fast to ensure AR performed in real-time.

• Set-up time: the initialization time must be as short as possible for a practical AR

application.

To address the problems above, many markerless tracking and pose estimation methods

are proposed, all of which can be categorized into two classes: recursive tracking [103]

and tracking by detection [5]. Recursive techniques start the tracking process using an

initial guess or a rough estimation, and then refine or update it over time. They are called re-

cursive because the next estimation relies on the previous one, for example, feature match-

ing is performed on two consecutive frames. In contrast, tracking by detection techniques

can do frame by frame computation independently from previous estimation. In this case,

a priori information about the environment or the objects to be tracked is needed, for ex-

ample, a computer generated object 3D model. The feature matching is only between the

current frame and the initial frame. The advantage of recursive tracking is smoothness and

robustness against sudden changes, but it may result in an error accumulation over the en-

tire tracking process. While tracking by detection is free from error accumulation [112], it

is much more computationally expensive given that a classifier [3, 50] or object 3D model

training [27] is needed.

Feature detection. To allow for an accurate pose determination, feature detection

must meet several requirements [39]:

Fast computation time. It must be possible to calculate a rather large number of

features points and associated descriptors in real-time in order to allow for a pose estimation

at an acceptable frame rate.

19

Robust against lighting, viewing angles and noise. The set of feature points to be

selected and their calculated descriptors shall not vary significantly under different light-

ing, noise and viewing angles. Some feature detection algorithms are highly sensitive to

the change of lighting and image taking devices noise, since they use image intensity as

features. In most AR applications, the users’ position and orientation are usually not fixed.

Thus, any feature points that are restricted to only one distance or viewing angle will be

useless for AR. All effects are quite common, particularly in outdoor environment.

Scale invariant feature choices. The scale of the captured object changes as the user

moves around the scene especially in outdoor environments due to perspective projection.

Thus, scale invariance must retain in the entire AR application, namely feature points vis-

ible from one distance will not disappear when getting closer or farther, allowing for a

continuous tracking event.

Visual markerless pose trackers mainly rely on natural features in the user’s environ-

ment. Those features include blob-like image regions that has higher or lower intensi-

ties [30, 66, 72], edges [16], corners that are intersections of edges with stronger fea-

tures [14, 44, 68, 73, 87, 95, 101] , lines and circles [116]. Research has proved that corner

detectors are more efficient than blob or edge detectors, which allow for a more precise

position determination.

Majority of feature detection algorithms work by computing a corner response function

C across the image. Pixels which exceed a intensity-related threshold are then classified as

corners [87].

Moravec [73] computes the sum-of-squared-differences (SSD) between a patch around

a candidate corner and patches shifted from the candidate a small distance in a number of

20

directions. C is then the smallest SSD to be obtained, ensuring that extracted corners are

those locations which change maximally under translations.

Harris [44] builds a Hessian matrix of an approximated second derivative of the SSD

as:

H =

Ixx Ixy

Ixy Iyy

(2.1)

Harris then defines the corner response to be

C = |H| − k(trace(H))2 (2.2)

where eigenvalues are (λ1, λ2) and

|H| = λ1λ2 (2.3)

trace(H) = λ1 + λ2 (2.4)

A window with a score C greater than a threshold is considered a ”corner”.

Following Harris’s framework, Brown and Szeliski [14] define C as harmonic mean of

the eigenvalues (λ1, λ2) of H:

C =|H|

trace(H)=

λ1λ2λ1 + λ2

(2.5)

In a similar fashion, Shi and Tomasi [95] assume affine image deformation and con-

clude that it is better to use the smallest eigenvalue of H as the corner strength function:

21

Figure 5. Corner definition by Harris [44]. Figure obtained from OpenCV documentation[13]

C = min(λ1, λ2) (2.6)

Lowe [68] builds an image pyramid with different scales and resolutions.The scale

invariance is obtained by convolving the image pyramid with a Difference of Gaussians

(DoG) kernel, retaining locations which are optima in scale as well as in space.

Feature description. Once feature point are detected and extracted, how can com-

puter tell which feature point in this image matches that in another image. Feature vectors

or descriptors are designed to describe the attributes of a feature point by taking struc-

tural, neighborhood and region information into consideration. A feature descriptor must

contains the necessary information to distinguish two different features. For a markerless

AR system, feature descriptors should be scale and orientation invariant, not only allowing

for fast calculation but also be able to describe the concerning interest point as unique as

22

possible in order to reduce ambiguity and mismatches.

Several different feature descriptors have been proposed in the last decade, such as SIFT

[68], SURF [8] an improved version of SIFT, binary descriptors BRIEF [15], ORB [88],

BRISK [62], and FREAK [1].

The feature vector of SURF is almost identical to that of SIFT. SIFT describes the

features using the local gradient values and orientations of pixels around the keypoint. To

create a keypoint descriptor, the gradient magnitude and orientation at each image sample

point in a region around the keypoint location is firstly calculated. These samples are then

counted into orientation histograms summarizing the contents in subregions of 4×4. Then

for each subregion, the length of each arrow is assigned the sum of the gradient magnitude

in/near that direction. Then a SIFT descriptor is a 128-element vector consisting of sum of

the gradient magnitude. The high dimensionality makes it difficult to use SIFT descriptor

in real time, while SURF improves on SIFT by using a box filter approximation to the

Gaussian derivative kernel. This convolution is sped up further using integral images to

reduce the time spent on this step [91]. As described in [8] an orientation is first assigned

to the keypoint when constructing a SURF descriptor. Then a square region is constructed

around the keypoint and rotated according to the orientation. Similar to SIFT, the region

is split up regularly into smaller 4×4 square subregions. For each subregion, we compute

a few simple features at 5×5 regularly spaced sample points. The horizontal and vertical

Haar wavelet responses dx and dy are computed and summed up in each subregion and

form a first set of entries to the feature vector. The absolute values of the responses dx

and dy are also calculated, and together with the sum of vector to form a four-dimensional

descriptor. And for all 4×4 subregions, it results in a vector of length 64.

23

Figure 6. Feature matching. Figure obtained from OpenCV documentation [13].

BRISK is a 512 bit binary descriptor that computes the weighted Gaussian average

over a selected pattern of points near the keypoint. FREAK (Fast Retina Keypoint) is an

improved version of BRISK which is also a binary descriptor. This method is inspired

by human retinal pattern in the eye, thus it samples the points using weighted Gaussians

at a region around the keypoint. Comparisons among these feature descriptors in term of

rotation, scaling, change of viewpoint, brightness shift, Gaussian blur are pretested in [1].

Feature matching. Once a set of robust features have been detected and described,

they will be matched by comparing the distances between descriptors over the images

(Fig.6).

Feature matching is one of the most critical tasks in the entire tracking pipeline. A large

number of features may better describe the points however, the high number of potential

feature correspondences will result in a large computational effort, slowing down the track-

ing process significantly. Additionally, wrong feature pairs (so called outliers) would result

24

in wrong pose estimation.

Feature matching by image patches(also called template/block matching) is the easiest

way to find valid feature correspondences between images. The image patch matching first

selects a patch or a constant windows size(e.g. 8×8 pixels) to extract features, for example,

intensity and then compares this patch with all patches in other images. The goal is to

find the best matching patch with the closest distances using the summed square distance

function (SSD) or the sum of absolute distance function (SAD). Although image patches

are very error-prone with respect to changing light conditions or viewing angles, they are

not scale or orientation invariant. Thus, image patches can only be used for recursive

tracking while error accumulation is the drawback if no error checking mechanism is used.

To produce a reliable tracking result, scale and orientation invariant features such as

SIFT, SURF, FAST, BRISK and FREAK are usually used. Those features can be matched

by finding the two descriptors with e.g. smallest summed square distance or smallest

summed Hamming distance(SHD) if the descriptor is binary, for example BRISK and

FREAK. Further, the SSD or SHD should lie below a specified threshold to ensure a reliable

matching result. The threshold has to be defined carefully to avoid rejecting valid feature

correspondences on one hand, and accepting too many wrong feature pairs on the other

hand. Each type of feature detector needs its individual threshold to achieve an adequate

ratio between accepted and rejected feature correspondences.

If no information can be used to reduce the number of potential feature candidates,

e.g. from previous tracking iterations, an initial mapping can be extracted by a brute-

force search. However, brute-force is computational intensive with O(nm) for n image

features and m map features thus it should be applied as rare as possible. Fast Library

25

for Approximate Nearest Neighbors (FLANN) usually performs classification faster than

brute-force even with a large database of images. This method as detailed in [75] basically

performs a parallel searching of randomized hierarchical trees to achieve an approximate

matching of binary features. As discussed in [75], the hierarchical clustering tree gives

significant speedup over linear search over a binary and float descriptors.

To speed up the searching process for the best matching descriptors, some research

uses a kd-tree [9] or applies an feature culling mechanism [39]. For example the method

can uniquely categorize SURF or SIFT features into two different kinds of blob-like fea-

tures [39]: dark features with brighter environment or bright features with darker environ-

ment. Thus, the matching process can be speedup if two SURF features belong to different

categories without calculating vector distances.

Pose estimation. Camera pose is about the camera’s localization (X, Y, Z) and ori-

entation (θx, θy, θz) in the world coordinate system, where (θx, θy, θz) are rotation around

X, Y, Z axis respectively. Given a set of objects’ 3D points O0 = (P1,P2,P3...Pi) in

the world coordinate system, a pose estimation means to find an object’s 3D pose [R|t]

from a set of 2D point projections oj = (p1,p2,p3...pi) at the jth frame. Note that the

projection relationship means the correspondences C = Oj ⇐⇒ oj that are matches of

an object’s 3D points Oj detected on the camera frame oj. With a sufficient number of

correspondences the camera pose can be determined uniquely by:

oj = MOj, (Oj = O0) (2.7)

M = A[Rj|tj] (2.8)

26

where A is camera intrinsic matrix, Rj and tj are rotation matrix and translation vector

that brings the camera to its current pose at the jth frame. The definitions of R and t are

then given in Eqs. (2.9), and (2.13).

R = RZRYRX =

r11 r12 r13

r21 r22 r23

r31 r32 r33

(2.9)

RX =

1 0 0

0 cos (θx) − sin (θx)

0 sin (θx) cos (θx)

(2.10)

RY =

cos (θy) 0 sin (θy)

0 1 0

− sin (θy) 0 cos (θy)

(2.11)

RZ =

cos (θz) − sin (θz) 0

sin (θz) cos (θz) 0

0 0 1

(2.12)

t =

tx

ty

tz

(2.13)

The most general version of the pose estimation problem requires to estimate the 6-DoF

of the extrinsic parameters [R|t] and five intrinsic parameters in A: focal length, principal

point, aspect ratio and skew. It could be established with a minimum of 6 correspondences,

27

using Direct Linear Transform (DLT) algorithm [97]. There are, though, several simplifi-

cations to the problem that turn into an extensive list of different algorithms to improve the

accuracy of the DLT.

In practical AR system, several different markerless pose estimation algorithms are

implemented [40, 61, 84, 97], a comparison among them is given by [77]:

Direct Linear Transformation (DLT). These algorithms intend to directly estimate

the camera pose [R|t] ignoring certain restrictions on the solution space.

Perspective n-Point (PnP). The most common simplification of DLT is Perspective-

n-Point (PnP) problem [79,98]: given a set of correspondences C = Oj ⇐⇒ oj between

3D points and their 2D projections, we seek to retrieve the pose ( R and t) of the camera

w.r.t. the world and the focal length F . At least n points are needed, usually camera

intrinsics A is a prior knowledge.

A priori information estimators. Before pose estimation, some information regard-

ing the camera pose is often available. The algorithms belonging to this class use the priori

information as initials to provide more reliable and faster results.

Regardless of restrictions or assumptions, almost all algorithms share a similar routine

of direct linear transform (DLT), which recovers a 3× 4 camera matrix M , recall (2.7):

M =

M11 M12 M13 M14

M21 M22 M23 M24

M31 M32 M33 M34

= A[R|t] =

fx 0 cx

0 fy cy

0 0 1

r11 r12 r13 t1

r21 r22 r23 t2

r31 r32 r33 t3

(2.14)

28

From the perspective projection equation (4.2):

ui =M11Xi +M12Yi +M13Zi +M14

M31Xi +M32Yi +M33Zi +M34

(2.15)

vi =M21Xi +M22Yi +M23Zi +M14

M31Xi +M32Yi +M33Zi +M34

(2.16)

where pi = (ui, vi) are the projected 2D points of the object’s points Pi = (Xi, Yi, Zi).

This system of equations can be solved in a linear fashion for the unknowns in the camera

matrix M by multiplying the denominator on both sides of the equation. The resulting

algorithm is called the direct linear transform (DLT) and is commonly attributed to [97]. In

order to compute the 12 (or 11) unknowns in M , at least six correspondences between 3D

and 2D locations must be known.

More accurate results for the entries in M can be obtained by directly minimizing the

set of Equations (2.16), using non-linear least squares with a small number of iterations.

Once the entries in M have been recovered, it is possible to recover both the intrinsic

calibration matrix A and the rigid transformation (R, t) using RQ factorization.

In the case where the camera is already calibrated, i.e., the matrix A is known, we can

perform pose estimation using at least 3 points, this method is known as P3P [40], but the

results are noisy. The so-called P3P provides at most four different poses due to geometric

ambiguities. Thus, necessary further feature correspondences are used to determine the

unique pose. Several approaches have been proposed to solve the P4P, P5P or even PnP. The

PnP problem can also be solved by the iterative method [102] or the non-iterative efficient

algorithm EPnP [61] with the efficiency of O(n) (where n ≥ 4). The key implementation

of the algorithm with such linear complexity is, to represent the Pi as a weighted sum of

29

control points (ω1...ωm) where m ≤ 4, and to perform all further computation only on

these points. Although the more feature correspondences the more accurate the extracted

pose is, a set of detected feature correspondences usually contain invalid feature pairs due to

mismatching leading to inaccurate or even invalid pose calculations. Thus, invalid features

correspondences must be exposed explicitly. The well-known RANSAC [38] algorithm can

be used to determine the subset of valid correspondences from a large set of feature pairs

with outliers. RANSAC randomly chooses few feature pairs to create a rough pose using

the algorithm such as P3P, and tests this pose geometrically for all of the remaining pairs.

Several iterations are then applied using different permutations of feature pairs, resulting in

pose where most feature correspondences could be confirmed (Fig. 7). However, RANSAC

does not optimize the final pose for all valid feature correspondences. As [77] explains, the

quality of RANSAC reduces with the increasing ratio of outliers. When the outlier ratio

reaches 60%, it is not possible to estimate a robust camera pose within a reasonable time

frame. Therefore, the pose should be refined to reduce the overall error of all applied feature

correspondences. One refinement method is to optimize the pose with respect to the total

projection error. Commonly, several iterations of nonlinear least squares algorithms e.g.

the Levenberg-Marquardt (LM) algorithm [83] are sufficient to converge to the final pose.

Further, different feature correspondences may be weighted e.g. according to their expected

position accuracy to increase the pose quality. For example, an M-estimator [83] can be

applied automatically neglecting the impact of outliers or inaccurate feature positions.

30

Figure 7. RANSAC used to filter matching outliers. The colored correspondences arecomputed using the SIFT feature detector providing many good matches along with a smallnumber of outliers. Left: All colored correspondences are used, including the outliers.Right: RANSAC is used in order to find the set of inliers (green correspondences). Figureobtained from OpenCV documentation [13].

31

Chapter 3

Fast 3D Reconstruction Using One-Shot Spatial Structured Light

Introduction

This chapter presents a fast 3D reconstruction method based on one-shot spatially multi-

plexed structured light. In particular, a De Bruijn sequence [105] is used for encoding a

color grid. To improve the resolution and field of view (FOV) under larger dynamic scenes,

we extend our prior work [100] to use 8 colors instead of 6 [81]. After adding two more

colors, the resolution is increased from 841 points to 4356 points. But this also brings up

a problem for color detection. To compensate that, we propose a Hamming distance based

method to improve the robustness. With only a single shot of a color grid, the depth infor-

mation of a target object is extracted in real-time. Given our focus on mobile applications

with dynamic scenes, some detail, such as sub-pixel accuracy, is no longer our concern.

Instead, with the aim of achieving a balance of efficiency and accuracy, we strive to im-

prove the speed of the proposed method while a decrease in accuracy can be tolerated by

the system. In our experiment, the entire process from projecting a pattern to triangulating

for mesh reconstruction takes less than 1 second.

The rest of the chapter is organized as follow: Section overviews our proposed ap-

proach, followed by its detail in Section .

Overview of the Approach

The proposed structured-light method consists of a color camera and a projector, both of

which are calibrated with respect to a world coordinate system and have their intrinsic and

extrinsic parameters known. When the projector projects a color-coded light pattern onto

32

Figure 8. System architecture

a 3D object, the camera captures its image immediately. The 3D information of the object

is then obtained by analyzing the deformation of the observed light pattern with respect to

the projected one and identifying their correspondence.

Matching pairs of light points and their images present several challenges. First, with

the aim to project only one light pattern for real-time applications, there is a tradeoff be-

tween reliability and accuracy to select proper color stripes in terms of the number of colors,

length of codeword, and dimensions. In our method, a two-dimensional 8-color De Bruijn

spacial grid is used to offer a confident solution as explained in Section III.A. The more

colors to use, the higher density and resolution. But the more colors, the better chance of

color misdetection. To overcome this drawback, our method uses a two-level color label

correction method. The grid pattern preserves spatial neighborhood information for all cap-

tured grid points. For instance, if two points are connected vertically/horizontally on the

captured image, their correspondences must follow certain topological constraints (e.g., the

same vertical/horizontal color). Those constraints are then used to correct color detection

errors as detailed in Section III.B. When dealing with more complex scenes that involve

shadows, occlusion, and discontinuities, observed geometric constraints can be mislead-

ing [104]. Thus, a De Brujin-based Hamming distance minimization method is established

for further correction.

33

3D Reconstruction Process

As illustrated in Fig.8, the reconstruction process starts with projecting, capturing, and im-

age processing in order to retrieve the grid overlain on the object. The recovered stripe

substrate forms a 2D intersection network to solve the correspondence problem. Without

loss of generality, the structured light codification is first presented in Section , followed

by a brief explanation of the image processing to obtain a 2D colored intersection net-

work in Section . The detailed methods to remedy color detection errors and incomplete

neighborhood information due to occlusion are then described in Section and .

Color-coded structured light. Our projected grid pattern is composed of vertical

and horizontal colored slits with a De Bruijn sequence. A De Bruijin sequence of order

n over an alphabet of k color symbols is a cyclic sequence of length kn with a so-called

window property that each subsequence of length n appears exactly once [105]. When

applying such a sequence to both vertical and horizontal stripes, it forms a m × m grid,

where m = kn + 2 and a unique k-color horizontal sequence overlain atop a unique k-

color vertical sequence only occurs once in the grid. Let C be a set of color primitives,

C = cl|cl = 1, 2, 3, ..., 8, where the numbers represent different colors. Thus, each

crosspoint in the grid matrix, vi, is an intersection of a vertical stripe and a horizontal stripe

with its unique coordinate (xi, yi) ∈ Ω, where Ω is the camera’s image plane and colors

(chi , cvi ),where chi , c

vi ∈ C, representing the horizontal and vertical colors of vi, respectively.

Therefore, the color-coded grid network can be interpreted as a undirected graph

G = (V,E, P, ξ, ϑ, χ, τ), where

34

• V = v1, v2, . . . vm×m represents grid intersections.

• eij ∈ E represents a link between two nodes i and j in the grid

• P is a vector of 2D coordinates of vj in the camera’s image plane, where pj =

(xj, yj); (xj, yj) ∈ Ω. `(P ) denotes the length of P .

• ξ: V → C is the color labels associated with a node where ξ(vi) = (chi , chj ); ξ

h(vi) =

chi , ξv(vi) = cvi .

• ϑ: V → ϑ(v1), ϑ(v2), . . . ϑ(vm×m), where

ϑ(vi) =⋃j Nbj(vi); j ∈ north, south, west, east; Nbj(vi) ∈ V represents the

neighbors of vi. When the jth neighbor of vi does not exist, Nbj(vi) is assigned a

value of NULL.

• χ: ϑ→ 0, 1 is a decision function with a binary output.

∀j ∈ north, south, west, east

∃Nbj(vi) ∈ ϑ(vi) = NULL ⇒ χ(ϑ(vi)) = 0,

else χ(ϑ(vi)) = 1.

• τ : V → P is a function for getting the 2D coordinates of vi, where τ(vj) = pj

In our implementation, C = 1, 2, ..., 8, representing Red, Yellow, Lime, Green, Cyan,

Blue, Purple and Magenta. The colors used for the horizontal stripes are Red, Lime, Cyan

and Purple, while Yellow, Green, Blue and Magenta are used for vertical stripes. They are

evenly distributed in HSV space for better color detection. Fig. 9 shows the resulting grid.

The total of 8 colors are significant in improving the density and resolution in comparison

35

to other earlier works [113], [20].

Figure 9. Color Grid Pattern

Image processing for grid extraction. There are several image processing methods

used in our implementation to retrieve the stripe substrate material overlain on the real

world scene. First, an adaptive thresholding with a Gaussian kernel is used to correctly

segment the projected and deformed grid. The idea is to check if the RGB value of every

pixel is greater than a threshold which is the Gaussian average of a local intensity. Given

that eight colors are used for vertical and horizontal stripes in our implementation, such a

threshold is not fixed but rather adapts to the local intensity of a 9× 9 neighborhood of the

pixel. The Hilditch thinning algorithm [45] is then applied until the skeleton of the grid is

obtained. Based on the skeleton of the grid, intersections can then be located. The idea is

to count the number of total rises, (i.e. changes in the pixel value from 0 to 1) and the total

neighbors with a value of one or high within a 3× 3 window centered at a pixel position of

interest. The position is considered as an intersection of the grid as long as the total rises

and the number of neighbors with a value of one are not less than 3 or the total rises is

greater than 1 and the number of neighbors with a value of one is great than 4.

36

Due to the complexity of the scene (e.g., shadows, distortion, depth discontinuity, and

occlusion, etc.), there exist spurious connections and holes, especially at the border of the

background and the object. To this end, the watershed algorithm [108] is first used for the

segmentation of the target object from the background, resulting in two grid networks. The

traversal algorithm used in our prior work [100] is then applied to the networks individually

with the purpose of determining the northern (Nbnorth(vi)), southern (Nbsouth(vi)), east-

ern (Nbeast(vi)) and western (Nbwest(vi)) neighbors of each intersection vi. The algorithm

moves a 3× 1 (for the north-south traversal) or 1× 3 (for the west-east traversal) window

along the direction it traverses to the most recently chosen path pixel, until either an inter-

section is found in this manner, or no intersection is found in an allotted maximum distance

from the original observed intersection. An intersection vi with four neighbors determined

is considered a perfect intersection which will be used for triangulation. Due to the

aforementioned challenges, not all the intersections have complete neighborhood informa-

tion. To avoid of losing those intersections when decoding the codeword for triangulation,

further process is needed as elaborated in Section .

Color labeling and correction. Each intersection in our color-coded grid is repre-

sented by its unique neighborhood information, particularly the colors of its neighbors. To

determine the color of each intersection, the captured color gird is blurred with a 3 × 3

Gaussian filter with a mask of skeleton grid obtained in Section III.B. The purified color

grid is then converted from RGB to HSV for color labeling of each pixel, where 8-pair

thresholds corresponding to the Hue ranges of the 8 colors are applied. To compensate for

the color crosstalk, two morphological openings (i.e., 5× 1 and 1× 5 structuring elements

are used to separate horizontal and vertical stripes.

37

To determine the color labels of each intersection (i.e., chi , cvi ), a special vote majority

rule is applied to a 9 × 9 neighborhood window of the intersection, where a 360 angular

search is used to populate the histograms of 8 color labels. Each histogram of a particular

color label has 18 bins, each of which corresponds to a non-overlapping 20 range. As

stated in Section III.A, the color labels 1, 3, 5, and 7 are used for the horizontal stripes,

while 2, 4, 6, 8 for the vertical stripes. Let ρk counts the number of observations that fall

into each of the κ bins in a histogram and ρclκ is a type of ρκ in the histogram of the color

label cl. The color labels for the intersection can then be determined as follows:

chi = g( maxcl=1,3,5,7k=1,2,3...,18

(ρclκ )) (3.1)

cvi = g( maxcl=2,4,6,8k=1,2,3...,18

(ρclκ )) (3.2)

where g(x) : x→ cl, cl ∈ C maps x to its corresponding color label.

Although the special vote majority rule substantially minimizes the impact of ill color

detection on color labeling for an intersection, there are still errors. In view of those errors,

a topological constraint is then used for further correction. If intersections are linked hori-

zontally in the camera image, their correspondences must be collinear. There is more than

that. The intersections should have the same color labels as they are on the same horizon-

tally stripe. The same idea should apply to the intersections on the same vertically stripe.

Thus when plotting the histogram of the color labels of the intersections on the same row

or column, further correction is performed, where the row/column histogram has 4 bins,

38

each corresponding to one of the four colors used for horizontal/vertical stripes.

(chi ) = g(max(ρhκ)), κ = 1, 2, 3, 4 (3.3)

(cvi ) = g(max(ρvκ)), κ = 1, 2, 3, 4 (3.4)

where ρhκ and ρvκ are a type of ρκ in the histogram of color labels for horizontal and

vertical stripes, respectively.

Neighbor completion using hamming distance. Each intersection vi together with

its four neighbors forms two unique De Bruijn subsequences, one for the horizontal string

(i.e., (ξh(Nbwest(vi)), chi , ξh(Nbeast(vi))) and the other for the vertical string (i.e., (ξv(Nbnorth(vi)),

cvi , ξv(Nbsouth(vi)). One difficult in finding the correct correspondences is that not all in-

tersections have complete neighborhood information even after the traversal elaborated in

Section III.C. The codewords resulted from those so-called ”imperfect” intersections are

literally broken, making it impossible to map them to the projected De Bruijn color grid.

To this end, our method proposes a Hamming Distance-based amendment procedure. The

procedure first defines a function f(vi) that returns color labels of itself and its neighbors’

for the intersection vi,

f(vi) = [(rvw, rhw); (cvi , c

hi ); (rvn, r

hn); (rve , r

he ); (rvs , r

hs )] (3.5)

where (rvw, rhw); (rvn, r

hn); (rve , r

he ); (rvs , r

hs ) represent western, northern, eastern, and southern

neighbors’ horizontal and vertical color labels of vi. If vi has complete neighborhood infor-

mation, the value of r can be retrieved from ξ() in the grid graph; otherwise, the neighbors

39

are virtual with the colors to be determined through the procedure. The algorithm then

traverses the retrieved intersections using a 3 × 4 or 4 × 3 moving window, depending on

the task of either labeling vertical or horizontal color of the missing neighbors. For each

intersection with missing neighbors, the procedure assigns the color labels of its missing

neighbors to 0. For instance, if the northern neighbor of vi is missing, rvw = rhw = 0 . Doing

so makes the originally broken codewords complete while invalid. Such an amended but

faulty word should have the minimal Hamming distance to its corresponding correct one.

According to this idea, the procedure then creates virtual neighbors of each ”imperfect”

intersection with correct color labels. Compared to real intersections, virtual intersections

only have color labels without coordinates. The detailed amendment procedure is given in

Algorithm 1 shown in Appendix.A. For clarity, the algorithm only includes the procedures

of creating virtual neighbors and assigning their vertical color labels. The same procedures

except for the moving window size should be applied to determine their horizontal colors.

In Algorithm 1 (Appendix.A), the inputs are

• G: original retrieved intersection grid.

• ~v = s0, s1, ..., sm×m−1: a matrix set where each element sj is a 3 × 4 matrix

with each row element forming a vertical De Bruijn subsequence (a0j, a1j, a2j, a3j)

of length 4,

sj =

a0j a1j a2j a3j

a0j a1j a2j a3j

a0j a1j a2j a3j

.

• H: the function for calculating per-element Hamming distance between two matri-

40

ces.

The outputs are:

• rvs , rvw, rve and rvn.

• G′: amended intersection grid with virtual neighbors.

The two matrices πv and πh are of size 3 × 4 and 4 × 3 for horizontal and vertical

separately.

41

Chapter 4

Markerless Tracking and Camera Pose Estimation

Introduction

The proposed markerless AR system consists of two main stages: range finding using

SL (Chapter 3) and markerless tracking by matching SURF descriptors and camera pose

estimation using PnP. The system architecture is shown in Fig. 10. The initialization,

image processing and reconstruction(range finding) are described in detail in Fig. 8. In this

chapter, the tracking and pose estimation process (the loop in Fig. 10) are discussed.

Figure 10. The proposed markerless AR system

System Calibration

A binocular 3D reconstruction system consists of an optical camera and a projector, where

the projector is used for pattern projection and the camera for image acquisition. The

precision and accuracy of reconstruction and tracking is highly dependent on the intrinsic

and extrinsic parameters of the camera and projector. Thus a proper system calibration is a

prerequisite for correct computation of any scaled or metric results from their images.

42

The intrinsic camera parameters consist of horizontal and vertical focal lengths fx, fy,

the principal point (cx, cx) of the camera and the skew. In our system the skew is assumed

to be zero. fx and fy can be calculated from camera/projector focal length F , Field Of

View(FOV) and the principal point is the image center. In order to compensate distortion

of the camera, the radial and tangential distortion coefficients up to the second order need

to be computed. In order to calibrate these, a sufficient amount of 2D-3D correspondences

has to be created, which is typically done using a known target

such as a chessboard. For the calibration of a low resolution camera, we use the cali-

bration method proposed by Zhang [118] through OpenCV [13] library. In this method, the

size of the chessboard and a single cell is defined by user. Several shots of the chessboard

are then captured and its inner corners are detected. The reprojection error between the

detected corners and the chessboard corners is reduced by iteratively updating the intrinsic

and extrinsic parameters of the camera.

Extrinsic parameters are translation vector t and rotation matrixR that bring the camera

to the projector in the 3D world coordinate system.

For each image the camera takes on a particular object, we can describe the pose of the

object relative to the camera coordinate system in terms of a rotation and a translation; see

Fig. 11.

A pinhole camera model describes a perspective transformation from 3D to 2D, where

a scene view is formed by projecting 3D points into the image plane using Eq. 4.1.

s p′ = A[R|t]P ′ (4.1)

43

Figure 11. Converting from object to camera coordinate systems: the point P on the objectis seen as point p on the image plane; the point p is related to point P by applying a rotationmatrix R and a translation vector t to P [49]

or

s

u

v

1

=

fx 0 cx

0 fy cy

0 0 1

r11 r12 r13 t1

r21 r22 r23 t2

r31 r32 r33 t3

X

Y

Z

1

(4.2)

where:

• (X, Y, Z) are the coordinates of a 3D point P in the world coordinate space

• (u, v) are the coordinates of the projection point p in pixels

• A is a camera matrix, or a matrix of intrinsic parameters

• (cx, cy) is a principal point that is usually at the image center

44

• fx, fy are the focal lengths expressed in pixel units.

The intrinsic parameters are independent on the scene viewed. Once calibrated, it can

be re-used as long as the focal length is fixed. The joint rotation-translation matrix [R|t]

is called a matrix of extrinsic parameters. It describes the camera motion around a static

scene, or vice versa, rigid motion of an object in front of a still camera. The rigid transfor-

mation above is can be written as (when z 6= 0 ):

x

y

z

= R

X

Y

Z

+t

x′ = x/z

y′ = y/z

u = fx ∗ x′ + cx

v = fy ∗ y′ + cy

Real lenses usually have some distortion, mostly radial distortion and slight tangential

distortion. So, the above model is extended as:

45

x

y

z

= R

X

Y

Z

+ t

x′ = x/z

y′ = y/z

x′′ = x′ 1+k1r2+k2r4+k3r6

1+k4r2+k5r4+k6r6+ 2p1x

′y′ + p2(r2 + 2x′2)

y′′ = y′ 1+k1r2+k2r4+k3r6

1+k4r2+k5r4+k6r6+ p1(r

2 + 2y′2) + 2p2x′y′

where r2 = x′2 + y′2

u = fx ∗ x′′ + cx

v = fy ∗ y′′ + cy

Where k1, k2, k3, k4, k5, and k6 are radial distortion coefficients. p1 and p2 are tangential

distortion coefficients. The distortion coefficients do not depend on the scene viewed. Thus,

they also belong to the intrinsic camera parameters. For the distortion we take into account

the radial and tangential factors [13]. For the radial factor we use the following formula:

xcorrected = x(1 + k1r2 + k2r

4 + k3r6) (4.3)

ycorrected = y(1 + k1r2 + k2r

4 + k3r6) (4.4)

So for an pixel point at (x, y) coordinates in the original image, its position on the

corrected image should be (xcorrected, ycorrected). Tangential distortion occurs when the

camera lenses are not perfectly parallel to the imaging plane. It is corrected using the

46

formulas below:

xcorrected = x+ [2p1xy + p2(r2 + 2x2)] (4.5)

ycorrected = y + [p1(r2 + 2y2) + 2p2xy] (4.6)

So we have five distortion parameters presented as one row matrix with 5 columns:

Distortioncoefficients = (k1 k2 p1 p2 k3) (4.7)

The projector’s intrinsic and camera-projector pair extrinsic parameters can be cali-

brated using the aforementioned method by treating the projector as a reverse camera, we

use the software provided by [74] for system calibration.

Feature Based Tracking

As reviewed in Section , in a practical AR application, markerless tracking must meet the

following requirement:

• Robustness: the tracking must be robust against noise and be able to recover from

failure.

• Performance: the tracking must be fast enough to ensure AR perform in real-time.

• Set-up time: the initialization time must be as short as possible for a practical AR

application.

In our markerless system, SURF detector and descriptor are used due to its robustness

against weak light condition, invariant to scale/orientation and fast computation. All of

47

these features ensure a real-time tracking.

SURF detector is recognized as a more efficient substitution for SIFT. It has a Hessian-

based detector and a distribution-based descriptor generator.

The SURF detector is based on the Hessian matrix, defined as

H(x, σ) =

Lxx(x, σ) Lxy(x, σ)

Lxy(x, σ) Lyy(x, σ)

(4.8)

L is the convolution of the Gaussian second order derivative with the image I in point

x. In the implementation, they use a simpler box filter instead of Gaussian filter in [68].

An image pyramid is built and convolved with the box filter to obtain scale-invariance

properties.

To describe a SURF keypoint, an orientation is first assigned to the keypoint. For each

direction a square region is constructed around the keypoint SURF descriptor is extracted.

The region is split up regularly into smaller 4× 4 square subregions. The Haar wavelet re-

sponses in horizontal dx and vertical dy are calculated and summed up over each subregion

and form a first set of entries to the feature vector. The absolute values of the responses |dx|

and |dy| are also calculated, and together with the sum of vector to form a four-dimensional

descriptor. And for all 4× 4 subregions, it results in a vector of the length of 64.

In our system, the camera captures an image of the scene at the initialization stage

and computes the SURF descriptors of each single detected keypoint. Note that the 3D

coordinates of these keypoints are already calculated in the previous 3D reconstruction

stage as described in Chapter.3. The 3D coordinates of the keypoints are assigned to the

3D coordinates of their nearest color grid intersections. In the end, these keypoints and

48

their descriptors, 2D locations are stored as a reference.

As the camera/object moves around the scene, SURF descriptors are extracted from

each incoming frame. Once SURF generates its float point descriptor, a Brute Force k-

nearest neighbors (k-NN) based matching algorithm is applied to find the 2 nearest matches

between the descriptors. The distance metric is L2 norm. To get rid of the mismatched key-

points pairs and guarantee strong correspondences, a threshold of distance ratio is applied,

only when the ratio between the best match distance and the second to the best match is

below a specific threshold e.g. 0.7 , the best match is determined as a ”correct” one.

Then, the detected 2D/3D feature correspondences are taken to determine the exact

camera 6-DoF pose by applying several Perspective-n-Point (PnP) iterations, followed by

a pose refinement step with a non-linear least-square algorithm. If enough feature corre-

spondences have been detected and validated by the RANSAC iterations, the number of

used features n will be decreased, therefore speeding the pose estimation process and the

6-DoF camera pose is computed using the n features.

Camera Tracking and Pose Estimation Using Perspective-n-Points

In our case, to allow for an accurate pose estimation the camera intrinsics are first cali-

brated using the method in Section . With the known camera intrinsics, iterative PnP [83]

that is based on Levenberg-Marquardt optimization is used for pose estimation. This func-

tion finds such a pose that minimizes reprojection error, which is the SSD between the

observed 2D projections on camera frames and the projected object points using formula

(4.1). Although iterative PnP is slower than P3P [40] or EPnP [61], it returns the most

robust estimation. The pose estimation procedure is as follows:

49

1. A set of object’s points are selected and their 3D world coordinates are reconstructed

using the SL range-finding method described in Chapter 3: (P1,P2,P3, ...Pi), (i ≥

6). Their initial 2D coordinates in the camera image plane are (p01,p

02,p

03, ...p

0i )

(Fig.15a ).

2. The SURF descriptor (d01,d

02,d

03, ...d

0i ) of the initial 2D keypoints (p0

1,p02,p

03, ...p

0i )

is extracted in the initial frame (frame 0).

3. As the camera/object moves in the scene, a set of keypoints (pj1,p

j2,p

j3, ...p

jk) in the

current incoming frame (jth frame) are detected and described by SURF, so do their

descriptors (dj1,d

j2,d

j3, ...d

jk).

4. A Brute Force 2-NN(k-NN k = 2) based matching algorithm (Algorithm 1) is

applied to find the correspondences between the initial 2D keypoints’ descriptors

(d01,d

02,d

03, ...d

0i ) and SURF keypoints’ descriptors (dj

1,dj2,d

j3, ...d

jk) in current frame

(jth frame) using L2 norm. In our practical application, a stable and fast matching

result is obtained using the parameters τ = 3, and i, k > 6 in Algorithm 1.

5. An iterative PnP algorithm is used to solve the rotation matrix R and translation

vector t using the correspondences between 3D coordinates (P1,P2,P3, ...Pi), (i ≥

6) and new 2D coordinates (pj1,p

j2,p

j3, ...p

jk).

6. Repeat steps 3-5 to update the camera pose as camera move around the scene (Figs.

15).

50

Algorithm 1 Brute Force k-NN based matching algorithm1: Create a matching score matrix ∆2: for m← 1 to k do3: for n← 1 to i do4: ∆[m,n] = δmn = L2(d

0m,d

jn)

5: end for6: end for7: for m← 1 to k do8: qmn = min(δm1, δm2, δm3, . . . , δmi)

q′ml = min ((δm1, δm2, δm3, . . . , δmi)\qmn)

9: ifqmnq′ml

> τ then

10: A strong match pair of discriminators (d0m,d

jn) is defined by m,n

11: end if12: end for13: The created matching score matrix ∆:14:

∆ =

δ11 δ12 δ13 . . . δ1iδ21 δ22 δ23 . . . δ2i...

...... . . . ...

δk1 δk2 δk3 . . . δki

51

Chapter 5

Experiment and Results

Our implementation setup is composed of (a) a desktop with an Intel quad-core (3.40

GHz) i7-2600k CPU, 8G DDR3 1600 memory, and a 64-bit Windows 7 operating system;

(b) a Point Grey Flea3 color camera running at a resolution of 640×480; and (c) an Optima

66HD DLP projector set to the resolution of 1280 × 800 as shown in Fig. 12. When the

projector projects a color-coded light pattern onto a 3D object, the camera located at 1-

meter away from the object captures the image. Both camera and projector are calibrated

with their intrinsic and extrinsic parameters in Tables. 1, 2and 3). To validate the accuracy

and speed of our proposed method, a plaster bust with a complex surface is used in our

experiments. The proposed markerless AR system then performs at 30 frames per second

(fps), which is considered real-time in our context.

Table 1

Camera and projector intrinsic parameters

Intrinsics fx fy cx cy k1 k2 p1 p2 k3 εCamera 1796.10 1792.37 259.97 288.02 -0.23 -5.46 0.00 0.00 0.00 0.09

Projector 1404.45 1348.78 477.97 276.15 -0.48 2.57 -0.03 -0.01 0.00 0.09

Table 2

Camera-projector pair rotation matrix

Rotation matrix (R) r11 r12 r13 r21 r22 r23 r31 r32 r33Cam-proj pair 0.95 0.06 -0.32 -0.03 0.99 0.11 0.33 -0.09 0.94

52

Figure 12. System setup: a projector on the bottom-left, camera on the right and a plasterbust in the scene.

53

Table 3

Camera-projector pair translation vector

Translation vector (t) tx ty tzCam-proj pair 377.29 -291.63 243.83

3D Reconstruction

(a) (b) (c)

Figure 13. SL reconstruction process: (a) A plaster bust in our experiment. (b) Color gridpattern projected onto the plaster bust. (c) Extracted color grid from (b)

Figs. 13 gives the object (Figs. 13a) and the captured intersection network with 62×60

color stripes (Figs. 14b). Due to the highly deformed surface, especially at the hair, eyes,

and nose areas, there are several holes as depicted in Fig. 13c. Most of them are due to

occlusion distortion. Note that with the help of the watershed algorithm, the impact of sharp

depth discontinuity between the background and object is minimal. The reconstructed point

cloud and mesh are shown in Fig. 14. In the experiment, the total number of intersections

in the projected De Bruijn grid (66× 66) is 4356. However, four horizontal and six vertical

stripes are out of the camera’s FOV, resulting in the size of the captured grid 62 × 60

with 3720 intersections. Due to occlusion distortion, there are 393 intersections missing,

54

especially in the areas of hair, eyes and boarders of the bust and background. With this

ground truth, our method is able to recover 3327 intersections with 267 labeled wrong and

110 containing incomplete neighborhood information. This corresponds to 88.6% correct

assignment.

(a) (b)

Figure 14. Reconstructed point cloud and mesh: (a) and (b)

With the focus on local optimization, such as the vote majority rule for color detection

and correction, and De Bruijn-based Hamming distance to improve neighborhood infor-

mation of intersections, our method achieves the best speed in comparison to the existing

one-short methods [104] (the running time for 1000 grid intersections was about 3 min-

utes), [21] (the running time for 1110 grid intersections was about 313 ms). In unoptimized

C++ code, the running time for 3327 grid intersections was about 1 second.

Markerless Tracking and Pose Estimation

In our AR system the world origin is at the camera optical center. As shown in Fig. 11 the

direction pointing from the camera optical center to the camera face is defined as positive

Z. Similarly, positive Y is defined as the direction from the camera optical center to the

camera bottom and positive X the direction from the camera optical center to the camera

55

right. The clockwise rotation around X, Y, Z axes (i.e., rz) is termed as pitch, yaw, and

roll), respectively. In a similar fashion, the translations along X, Y, and Z axes are defined

as tx, ty,tz respectively.

A tripod with a camera ball mount is used to enable camera rotation and translation. For

the ground truth, an iPhone 6 is mounted on the top of the camera. Thus, while the iPhone

rotates with the camera, its built-in sensor (i.e., InvenSense MP67B) can precisely measure

the rotation. On the other hand, a tape ruler is used to physically measure the translation.

Eventually, by moving the tripod around the scene, the performance of the developed sys-

tem is tested in terms of camera rotation and translation. In our system setup (Chapter 5),

we test the propose markerless AR system in a scene of 100 inches×60 inches×60 inches,

while the bust (or a part of the bust) stays in the camera’s FOV.

In the first set of experiments, the rotation and translation are computed and measured

between the camera initial pose and the current frame. As exemplified in Fig. 15, the first

set of SURF keypoints (red dots) are extracted in the initial frame (Fig. 15a), where an

initial pose is estimated using the 3D-2D correspondences and visualized by the X, Y, and

Z axes. The X axis (red) points to the right, Y axis (green) down, and Z axis (blue) the

inside of the image. While the camera moves in the scene, a new set of SURF keypoints

are extracted for new pose estimation as shown in Figs. 15b, 15c and 15d. Note that the

number of keypoints to be extracted varies each frame. Even when occlusion occurs as

shown in Fig. 15d, our AR system is still robust enough to correctly compute the pose.

In the second set of experiments, the camera’s motion data ( (i.e., rotation and transla-

tion) is obtained and compared with the ground truth at 27 different displacements of the

camera. Due to sensor noise, we acquire 10 groups of the motion data at each displace-

56

(a) (b)

(c) (d)

Figure 15. Camera tracking and pose estimation using PnP. The camera moves around thescene and its poses are estimated in real-time. The rotation and translation along X(red),Y(green), Z(blue) axis are rx, ry, rz and tx, ty, tz.(a) SURF keypoints (red dots) are extractedat initial pose. (b) Camera moves close to bust. (c) Camera moves away from bust. (d) Acorrect estimation with partial occlusion.

ment, including the pose estimation from our AR system (i.e., Cam(rx), Cam(ry), Cam(rz),

Cam(tx), Cam(ty), Cam(tz)) and the physical measurement (True(rx), True(ry), True(rz),

True(tx), True(ty), True(tz)) . The mean of the 10 data points are plotted and their corre-

sponding errors in comparison to the ground truth are presented in Fig. 16. Figs. 16a and

16c are for the rotation (in degrees), and Figs. 16b and 16d the translation (in inches). As

can be seen in Figs. 16a and 16b, the plots are well superimposed. With the tracked object

being 60 inches from the camera, the average absolute error of our system is 1.98 inches in

57

translation and 1.07 degrees in rotation (with a standard deviation of 1.91 inches and 2.42

degrees). More detail can be found in Table. 4. It can be noted that, the maximum error is

8.89 inches in translation and 7.25 degrees in rotation due to the following reasons:

• A rough calibration of the camera and projector pair causes 3D reconstruction and

tracking error.

• The physical measurement using tape ruler and iPhone 6 gyroscope has a low preci-

sion of 0.5 inch in translation and 1 degree in rotation.

• To allow for a real-time AR application, we set the camera shutter to be 15 ms and

gain to be 20 db, which results in noise increase and sometimes a flash effect that

lightness and contrast change suddenly on the incoming frame.

Table 4

Accuracy of the markerless pose estimation: Mean Error (inches and degrees), StandardDeviation, and Maximum Error

error(rx) error(ry) error(rz) error(tx) error(ty) error(tz)mean -0.03 -1.22 1.95 3.69 -0.90 1.35std 2.88 2.45 1.95 2.17 1.61 1.95max 6.98 1.97 7.25 8.89 1.17 8.65

Processing Speed Analysis

One of the development goals is to achieve real-time AR performance that supports users’

interactivity. When moving and rotating the camera along X, Y and Z axises respectively

our AR system consistently performs at an average of 30 fps as shown in Fig. 17. Note that

58

(a) (b)

(c) (d)

Figure 16. Accuracy of the proposed markerless pose estimation: (a) and (c) camera trans-lation measured using the tape ruler and our AR system. (b) and (d) corresponding errors(inches, degrees).

the measure of FPS is done when moving the camera in the scene described in Section 5.2.

The FPS is calculated by 1/processing time, where the process time that includes tracking

and pose estimation for a single frame.

To understand how each process plays in the overall system speed for future improve-

ment, we look into the time complexity of each process carefully. First, we measure the

time spent on the three processes (i.e., initialization, image processing, and reconstruc-

tion) that only occur once and only once in the AR pipeline. The initialization process

includes all the stages that create and assign global variables, such as De Bruijn color grid

pattern, camera-projector calibration, and the first frame. All the necessary procedures

59

Figure 17. Frame rate (fps) of the proposed method presented

that find intersections correspondences belong to image processing, for example, adaptive

thresholding to extract color grid, Gaussian blur to remove image noise, Hue thresholding,

morphological operations, and skeletonization, etc. The reconstruction involves intersec-

tion neighbors traversal, intersections color labeling, Hamming distance-based color label

correction, intersection 2D coordinates computation by decoding De Bruijn sequence, and

intersections triangulation. As can be seen in Fig. 18, the initialization costs 3641 ms, im-

age processing 693 ms and reconstruction 502 ms. The skeletonization (366 ms), morpho-

logical operations (201 ms) and thresholding (126 ms) are relatively expensive processes.

Second, we evaluate the tracking and pose estimation in a similar fashion. As shown

in Fig. 18b, the total running time for this process is 34 ms (29.4 fps). Although itera-

tive PnP and brute-force matching algorithms are used, the proportion of matching (5 ms)

and pose estimation (6 ms) in the entire iterative tracking and pose estimation process are

smaller than SURF feature detection and description (25 ms). According to Fig. 17, the

three procedures are executed at the average of 30 Hz. For each tracked frame, there are

60

(a)

(b)

Figure 18. Overall performance: (a) Initialization. (b) Iterative tracking and pose estima-tion.

approximate 10 SURF feature points to be tracked. As also shown in Fig. 17, there are

some drops in the FPS curve due to the increase of detected SURF feature points. For

instance, at the frame 193, the system detects 15 feature points. The extra time spent on

feature description, matching and iterative PnP pose estimation of those points results in a

reduced fps of 10.5 .

61

Chapter 6

Discussion and Conclusion

This thesis proposes an markerless augmented reality system that uses one-shot spa-

tially multiplexed structured light. In our AR system, the 3D points are obtained by SL

reconstruction, 2D points are obtained by SURF and pose is simultaneously estimated us-

ing 3D-2D correspondences. The application of SL successfully overcomes the deficiency

of existing marker and markerless AR methods that either utilize fiducial markers, prior

information about the real world scene, or environment-environment comparison, thereby

making the proposed methodology more suitable for AR in an unknown and uncertain

environment. The fast one-shot real-time 3D reconstruction (range-finding) method uses

De Bruijn encoded color pattern. Unlike previous one-short approaches, our formulation

focuses on local optimization, where a Hamming distance minimization method is used

for finding missing topological information and a special vote majority rule is applied for

color detection and correction. The proposed range-finding method is approved to be effi-

cient with a desirable accuracy achieved, even when applied to a very complicated scene.

Note that with the focus on mobile applications, such as mobile interactive augmented re-

ality [100], we emphasize more on speed over accuracy. Thus we only use about 10 in

our application to sufficient estimate for camera pose as a larger number of 3D points may

significantly slow down the process of tracking and pose estimation. The proposed feature

points based tracking and pose estimation method is tested in an indoor application, the

average absolute error is 1.98 inches in translation and 1.07 degrees in rotation, which is

considered reliable in our scenario. The entire system was tested on a desktop and was

62

shown to operate at speeds of approximately 30 Hz with tracking and pose estimation in

each frame.

Discussion and Future work

The evolution of the proposed algorithm is in its infancy and has many areas of potential

improvement, particularly in the following aspects:

• This work is only a proof-of-concept that our AR system can be implemented on

mobile devices. One of our next step is to transplant the code to mobile devices for

verification.

• Our AR system is based on SL and visual tracking, so color and feature points de-

tection is sensitive to the light condition. Therefore, an adaptive color and feature

detection algorithm with automatic outliers rejection mechanism is worth of future

investigation to enhance the robustness of 3D reconstruction and tracking.

• Another future step that is work-in-progress is to propose a camera-projector pair ex-

trinsic parameters calibration using SL. This method uses correspondences between

projected color grid pattern and camera frame to estimate an essential matrix using

RANSAC. The rotation matrix and translation vector are then obtained by decom-

posing the essential matrix. This calibration method simplifies the process described

in Section , since no extra software is needed.

• To improve the accuracy of the reconstruction method, the study of global optimiza-

tion algorithms, such Markov Random Field (Section ) is another future direction that

find an optimal solution for the correspondence problem in Section , where topolog-

ical and epipolar constraints will be considered as score functions.

63

• The performance of the proposed pose estimation method is highly dependent on

feature detection, description and matching algorithm as illustrated in Fig. 18b. Al-

though in our implementation, the SURF detector, descriptor and brute-force 2-NN

matching algorithm are sufficient to achieve a 30 Hz AR, this performance will de-

crease when applied to mobile devices, due to hardware limitation. The future work

is to find a balance between computational cost and reliability of the tracking method.

We will test other faster detectors, descriptors and matching algorithms, such as

FAST detector, BRISK, FREAK descriptor and Hamming distance based matcher (if

use binary descriptor) or FLANN’s hierarchical-clustering approach [75]. Another

routine is to consider feature matching as a classification problem. Instead of creat-

ing feature descriptors and matching algorithms, feature correspondences can also be

determined by classifiers [80] specifying the characteristics of feature points [39,60].

• As hardware of mobile devices evolves, acceleration methods are becoming popu-

lar to mobile applications. For example, modern mobile devices are equipped with

multi-core CPU and GPU. The acceleration methods on desktop can also be applied

to mobile devices, such as parallelization through CPU and GPU acceleration.

• The tracking and pose estimation method in our AR system requires the object of

interest to stay in the camera’s FOV though the entire process. However, this is not

always true, particularly for outdoor applications where occlusions and view angle

loss happens. Feature map that is used to store detected features, such as the one

in PTAM [56] and MonoSLAM [31] can be created and combined with optical flow

algorithms [59] to address this issue. As the system works around the scene, the

feature map dynamically updates itself. Even when occlusions occur, the lost feature

64

can be easily retrieved from a different angle in the map.

• Finally, we plan to compare our AR system with other markerless AR applications,

such as SfM-based [76, 90], SLAM-based [31], model-based [4, 27, 65] and hybrid

[59, 103] techniques, from speed, flexibility, robustness and portability perspectives.

65

References

[1] A. Alahi, R. Ortiz, and P. Vandergheynst. FREAK: Fast retina keypoint.Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 510–517, 2012.

[2] J. Arrasvuori, J. A. Holm, and A. J. Eronen. Personal augmented realityadvertising, Feb. 4 2014. US Patent 8,644,842.

[3] B. Babenko, M.-H. Yang, and S. Belongie. Robust Object Tracking with OnlineMultiple Instance Learning. IEEE Transactions on Pattern Analysis and MachineIntelligence, 33(8):1619–1632, 2011.

[4] H. Bae, M. Golparvar-Fard, and J. White. High-precision vision-based mobileaugmented reality system for context-aware architectural, engineering, constructionand facility management (AEC/FM) applications. Visualization in Engineering,1:3, 2013.

[5] I. n. Barandiaran, C. Paloc, and M. Grana. Real-time optical markerless trackingfor augmented reality applications. Journal of Real-Time Image Processing,5:129–138, 2010.

[6] A. Barry, J. Trout, P. Debenham, and G. Thomas. Augmented reality in a publicspace: The natural history museum, london. Computer, (7):42–47, 2012.

[7] M. Baudelet. Laser Spectroscopy for Sensing: Fundamentals, Techniques andApplications. Elsevier, 2014.

[8] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Speeded-up robust features (SURF).Computer vision and image . . . , 2008.

[9] J. L. Bentley. Multidimensional binary search trees used for associative searching.Communications of the ACM, 18(9):509–517, 1975.

[10] S. T. Birchfield, B. Natarajan, and C. Tomasi. Correspondence as energy-basedsegmentation. Image and Vision Computing, 25(8):1329–1340, 2007.

[11] C. Botella, J. Breton-Lopez, S. Quero, R. M. Banos, A. Garcia-Palacios,I. Zaragoza, and M. Alcaniz. Treating cockroach phobia using a serious game on amobile phone and augmented reality exposure: A single case study. Computers inHuman Behavior, 27(1):217–227, 2011.

[12] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization viagraph cuts. Pattern Analysis and Machine Intelligence, IEEE Transactions on,23(11):1222–1239, 2001.

[13] G. Bradski. The opencv library (2000). Dr. Dobb’s Journal of Software Tools, 2000.

[14] M. Brown and R. Szeliski. Multi-image feature matching using multi-scaleoriented patches. (December 2004), 2008.

66

[15] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua. BRIEF:Computing a Local Binary Descriptor very Fast. IEEE Transactions on PatternAnalysis and Machine Intelligence, 34(7):1281–1298, Nov. 2012.

[16] J. Canny. A Computational Approach to Edge Detection. IEEE Transactions onPattern Analysis and Machine Intelligence, PAMI-8(6), 1986.

[17] J. Carmigniani, B. Furht, M. Anisetti, P. Ceravolo, E. Damiani, and M. Ivkovic.Augmented reality technologies, systems and applications. Multimedia Tools andApplications, 51:341–377, 2011.

[18] K.-E. Chang, C.-T. Chang, H.-T. Hou, Y.-T. Sung, H.-L. Chao, and C.-M. Lee.Development and behavioral pattern analysis of a mobile guide system withaugmented reality for painting appreciation instruction in an art museum.Computers & Education, 71:185–197, 2014.

[19] C.-Y. Chen, B. R. Chang, and P.-S. Huang. Multimedia augmented realityinformation system for museum guidance. Personal and ubiquitous computing,18(2):315–322, 2014.

[20] S. Chen, Y. Li, and J. Zhang. Realtime structured light vision with the principle ofunique color codes. In Robotics and Automation, 2007 IEEE InternationalConference on, pages 429–434. IEEE, 2007.

[21] S. Chen, Y. Li, and J. Zhang. Vision processing for realtime 3-d data acquisitionbased on coded structured light. Image Processing, IEEE Transactions on,17(2):167–176, Feb 2008.

[22] W.-C. Chen, Y. Xiong, J. Gao, N. Gelfand, and R. Grzeszczuk. Efficient extractionof robust image features on mobile devices. In Proceedings of the 2007 6th IEEEand ACM International Symposium on Mixed and Augmented Reality, pages 1–2.IEEE Computer Society, 2007.

[23] K.-H. Cheng and C.-C. Tsai. Children and parents’ reading of an augmented realitypicture book: Analyses of behavioral patterns and cognitive attainment. Computers& Education, 72:302–312, 2014.

[24] R. Clouth. Mobile Augmented Reality as a Control Mode for Real-time MusicSystems. mtg.upf.edu, 2013.

[25] W. Commons. 3d scanning.https://commons.wikimedia.org/wiki/Category:3D_scanning#/media/File:3-proj2cam.svg.

[26] A. Comport, E. Marchand, and F. Chaumette. A real-time tracker for markerlessaugmented reality. In Mixed and Augmented Reality, 2003. Proceedings. TheSecond IEEE and ACM International Symposium on, pages 36–45, Oct 2003.

67

https://commons.wikimedia.org/wiki/Category:3D_scanning#/media/File:3-proj2cam.svg

https://commons.wikimedia.org/wiki/Category:3D_scanning#/media/File:3-proj2cam.svg

[27] A. A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Real-timemarkerless tracking for augmented reality: the virtual visual servoing framework.Visualization and . . . , 12(4):615–628, 2006.

[28] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt. 3d shape scanning with atime-of-flight camera. In Computer Vision and Pattern Recognition (CVPR), 2010IEEE Conference on, pages 1173–1180. IEEE, 2010.

[29] B. Curless and M. Levoy. A volumetric method for building complex models fromrange images. In Proceedings of the 23rd annual conference on Computer graphicsand interactive techniques, pages 303–312. ACM, 1996.

[30] a. J. Danker and a. Rosenfeld. Blob detection by relaxation. IEEE transactions onpattern analysis and machine intelligence, 3(1):79–92, 1981.

[31] A. Davison and I. Reid. MonoSLAM: Real-time single camera SLAM. PatternAnalysis and . . . , 2007.

[32] F. De Crescenzio, M. Fantini, F. Persiani, L. Di Stefano, P. Azzari, and S. Salti.Augmented reality for aircraft maintenance training and operations support.Computer Graphics and Applications, IEEE, 31(1):96–101, 2011.

[33] B. J. Dixon, M. J. Daly, H. Chan, A. D. Vescan, I. J. Witterick, and J. C. Irish.Surgeons blinded by enhanced navigation: the effect of augmented reality onattention. Surgical endoscopy, 27(2):454–461, 2013.

[34] S. O. Elberink and G. Vosselman. Quality analysis on 3d building modelsreconstructed from airborne laser scanning data. ISPRS Journal of Photogrammetryand Remote Sensing, 66(2):157–165, 2011.

[35] G. M. Farinella, S. Battiato, and R. Cipolla. Advanced Topics in Computer Vision.Springer London, London, 2013.

[36] P. Fechteler and P. Eisert. Adaptive color classification for structured light systems.In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEEComputer Society Conference on, pages 1–7, June 2008.

[37] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for earlyvision. International Journal of Computer Vision, 70:41–54, 2006.

[38] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981.

[39] B. Furht. Handbook of Augmented Reality. Springer Science & Business Media,2011.

[40] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng. Complete solution classificationfor the perspective-three-point problem. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 25(8):930–943, 2003.

68

[41] J. Geng. Structured-light 3D surface imaging: a tutorial, Mar. 2011.

[42] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz. Multi-view stereofor community photo collections. In Computer Vision, 2007. ICCV 2007. IEEE11th International Conference on, pages 1–8. IEEE, 2007.

[43] N. Haouchine, J. Dequidt, I. Peterlik, E. Kerrien, M.-O. Berger, and S. Cotin.Image-guided simulation of heterogeneous tissue deformation for augmentedreality during hepatic surgery. In Mixed and Augmented Reality (ISMAR), 2013IEEE International Symposium on, pages 199–208. IEEE, 2013.

[44] C. Harris and M. Stephens. A Combined Corner and Edge Detector. Procedings ofthe Alvey Vision Conference 1988, pages 147–151, 1988.

[45] C. Hilditch. Comparison of thinning algorithms on a parallel processor. Image andVision Computing, 1(3):115–132, 1983.

[46] B. Huang and Y. Tang. Fast 3d reconstruction using one-shot spatial structuredlight. In Systems, Man and Cybernetics (SMC), 2014 IEEE InternationalConference on, pages 531–536. IEEE, 2014.

[47] S. Ieiri, M. Uemura, K. Konishi, R. Souzaki, Y. Nagao, N. Tsutsumi, T. Akahoshi,K. Ohuchida, T. Ohdaira, M. Tomikawa, et al. Augmented reality navigation systemfor laparoscopic splenectomy in children based on preoperative ct image usingoptical tracking device. Pediatric surgery international, 28(4):341–346, 2012.

[48] B. Jamali, A. Sadeghi-Niaraki, and R. Arasteh. Application of geospatial analysisand augmented reality visualization in indoor advertising. 2015.

[49] A. Kaehler and B. Gary. Learning OpenCV. 2013.

[50] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-Learning-Detection. IEEETransactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, July2012.

[51] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for avideo-based augmented reality conferencing system. In Augmented Reality,1999.(IWAR’99) Proceedings. 2nd IEEE and ACM International Workshop on,pages 85–94. IEEE, 1999.

[52] H. Kawasaki, R. Furukawa, R. Sagawa, and Y. Yagi. Dynamic scene shapereconstruction using a single structured light pattern. In Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June2008.

[53] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. InProceedings of the fourth Eurographics symposium on Geometry processing, 2006.

69

[54] J. Keil, M. Zollner, M. Becker, F. Wientapper, T. Engelke, and H. Wuest. The houseof olbrichan augmented reality tour through architectural history. In Mixed andAugmented Reality-Arts, Media, and Humanities (ISMAR-AMH), 2011 IEEEInternational Symposium On, pages 15–18. IEEE, 2011.

[55] T. G. Kirner, F. M. V. Reis, and C. Kirner. Development of an interactive book withaugmented reality for teaching and learning geometric shapes. In InformationSystems and Technologies (CISTI), 2012 7th Iberian Conference on, pages 1–6.IEEE, 2012.

[56] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. InMixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACMInternational Symposium on, pages 225–234. IEEE, 2007.

[57] G. Klein and D. Murray. Parallel tracking and mapping on a camera phone. InMixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE InternationalSymposium on, pages 83–86. IEEE, 2009.

[58] R. J. Lapeer, S. J. Jeffrey, J. T. Dao, G. G. Garcıa, M. Chen, S. M. Shickell, R. S.Rowland, and C. M. Philpott. Using a passive coordinate measurement arm formotion tracking of a rigid endoscope for augmented-reality image-guided surgery.The International Journal of Medical Robotics and Computer Assisted Surgery,10(1):65–77, 2014.

[59] T. Lee and T. Hollerer. Multithreaded hybrid feature tracking for markerlessaugmented reality. IEEE Transactions on Visualization and Computer Graphics,15(3):355–368, 2009.

[60] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-time keypointrecognition. Proceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition, 2:775–781, 2005.

[61] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate O(n) solution to thePnP problem. International Journal of Computer Vision, 81(2):155–166, July 2009.

[62] S. Leutenegger, M. Chli, and R. Y. Siegwart. BRISK: Binary Robust invariantscalable keypoints. Proceedings of the IEEE International Conference onComputer Vision, pages 2548–2555, 2011.

[63] M. Lhuillier and L. Quan. Match propagation for image-based modeling andrendering. Pattern Analysis and Machine Intelligence, IEEE Transactions on,24(8):1140–1146, 2002.

[64] M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction fromuncalibrated images. Pattern Analysis and Machine Intelligence, IEEETransactions on, 27(3):418–433, 2005.

[65] J. Lima, F. Simoes, L. Figueiredo, V. Teichrieb, J. Kelner, and I. Santos. Modelbased 3d tracking techniques for markerless augmented reality. In XI symposium onvirtual and augmented reality, pages 37–47, 2009.

70

[66] J. Liu, J. M. White, and R. M. Summers. Automated detection of blob structures byhessian analysis and object scale. Proceedings - International Conference on ImageProcessing, ICIP, pages 841–844, 2010.

[67] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surfaceconstruction algorithm. In ACM siggraph computer graphics, volume 21, pages163–169. ACM, 1987.

[68] D. G. Lowe. Distinctive Image Features from. 60(2):91–110, 2004.

[69] S. P. McPherron, T. Gernat, and J. J. Hublin. Structured light scanning forhigh-resolution documentation of in situ archaeological finds. Journal ofArchaeological Science, 36(1):19–24, 2009.

[70] A. Menozzi, B. Clipp, E. Wenger, J. Heinly, E. Dunn, H. Towles, J.-M. Frahm, andG. Welch. Development of vision-aided navigation for a wearable outdooraugmented reality system. In Position, Location and NavigationSymposium-PLANS 2014, 2014 IEEE/ION, pages 460–472. IEEE, 2014.

[71] P. Milgram, H. Takemura, A. Utsumi, and F. Kishino. Augmented reality: A classof displays on the reality-virtuality continuum. In Photonics for industrialapplications, pages 282–292. International Society for Optics and Photonics, 1995.

[72] A. Ming and M. Huadong. A blob detector in color images. ACM InternationalConference on Image and Video Retrieval (CIVR), pages 364–370, 2007.

[73] H. P. Moravec. Obstacle avoidance and navigation in the real world by a seeingrobot rover. Technical report, DTIC Document, 1980.

[74] D. Moreno and G. Taubin. Simple, accurate, and robust projector-cameracalibration. In Proceedings - 2nd Joint 3DIM/3DPVT Conference: 3D Imaging,Modeling, Processing, Visualization and Transmission, 3DIMPVT 2012, pages464–471, 2012.

[75] M. Muja and D. G. Lowe. Fast matching of binary features. Proceedings of the2012 9th Conference on Computer and Robot Vision, CRV 2012, pages 404–410,2012.

[76] R. Newcombe and A. Davison. Live dense reconstruction with a single movingcamera. Computer Vision and Pattern . . . , 2010.

[77] T. Noll, A. Pagani, D. Stricker, T. Noell, and D. S. D. De. Markerless Camera PoseEstimation - An Overview. OASIcs-OpenAccess Series in Informatics, 2006:45–54,2010.

[78] O. V. Olesen, R. R. Paulsen, L. Hojgaard, B. Roed, and R. Larsen. Motion trackingfor medical imaging: a nonvisible structured light tracking approach. MedicalImaging, IEEE Transactions on, 31(1):79–87, 2012.

[79] H. Opower. Multiple view geometry in computer vision, volume 37. 2002.

71

[80] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recognition in ten lines of code.Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, 2007.

[81] J. Pages, J. Salvi, and C. Matabosch. Robust segmentation and decoding of a˜ gridpattern for structured light. In Pattern Recognition and Image Analysis, pages689–696. Springer, 2003.

[82] S. C. Park and M. Chang. Reverse engineering with a structured light system.Computers & Industrial Engineering, 57(4):1377–1384, 2009.

[83] W. H. Press. Numerical recipes 3rd edition: The art of scientific computing.Cambridge university press, 2007.

[84] L. Quan and Z. Lan. Linear n-point camera pose determination. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 21(8):774–780, 1999.

[85] G. Reitmayr and T. W. Drummond. Going out: Robust model-based tracking foroutdoor augmented reality. Proceedings - ISMAR 2006: Fifth IEEE and ACMInternational Symposium on Mixed and Augmented Reality, pages 109–118, 2007.

[86] R. Roberto, D. Freitas, J. a. P. Lima, V. Teichrieb, and J. Kelner. ARBlocks: Aconcept for a dynamic blocks platform for educational activities. In Proceedings -2011 13th Symposium on Virtual Reality, SVR 2011, pages 28–37, 2011.

[87] E. Rosten and T. Drummond. Machine learning for high-speed corner detection.Lecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics), 3951 LNCS:430–443, 2006.

[88] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternativeto SIFT or SURF. Proceedings of the IEEE International Conference on ComputerVision, pages 2564–2571, 2011.

[89] J. Salvi, S. Fernandez, T. Pribanic, and X. Llado. A state of the art in structuredlight patterns for surface profilometry. Pattern recognition, 43(8):2666–2680, 2010.

[90] M. Salzmann and P. Fua. Deformable Surface 3D Reconstruction from MonocularImages, Sept. 2010.

[91] C. Schaeffer. A Comparison of Keypoint Descriptors in the Context of PedestrianDetection: FREAK vs. SURF vs. BRISK. pages 1–5, 2012.

[92] D. Scharstein and R. Szeliski. Stereo matching with nonlinear diffusion.International journal of computer vision, 28(2):155–174, 1998.

[93] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-framestereo correspondence algorithms. International journal of computer vision,47(1-3):7–42, 2002.

72

[94] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structuredlight. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEEComputer Society Conference on, volume 1, pages I–195. IEEE, 2003.

[95] J. S. J. Shi and C. Tomasi. Good features to track. Computer Vision and PatternRecognition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer SocietyConference on, pages 593–600, 1994.

[96] G. Sun, C. Qiu, Q. Qi, K. Mu, and B. Wang. The implementation of a liveinteractive augmented reality game creative system. In Proceedings of the 2013Fifth International Conference on Multimedia Information Networking andSecurity, pages 440–443. IEEE Computer Society, 2013.

[97] I. E. Sutherland. Three-dimensional data input by tablet. Proceedings of the IEEE,62(4):453–461, 1974.

[98] R. Szeliski. Computer Vision : Algorithms and Applications. Computer, 5:832,2010.

[99] V. Teichrieb, M. Lima, E. Lourenc, S. Bueno, J. Kelner, and I. H. F. Santos. ASurvey of Online Monocular Markerless Augmented Reality. International Journalof Modeling and Simulation for the Petroleum Industry, 1(1):1–7, 2007.

[100] M. Torres, R. Jassel, and Y. Tang. Augmented reality using spatially multiplexedstructured light. In Mechatronics and Machine Vision in Practice (M2VIP), 201219th International Conference, pages 385–390, Nov 2012.

[101] B. Triggs. Detecting Keypoints with Stable Position , Orientation and Scale underIllumination Changes. Framework, pages 100–113, 2004.

[102] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3d machinevision metrology using off-the-shelf tv cameras and lenses. Robotics andAutomation, IEEE Journal of, 3(4):323–344, 1987.

[103] A. Ufkes and M. Fiala. A markerless augmented reality system for mobile devices.In Proceedings - 2013 International Conference on Computer and Robot Vision,CRV 2013, pages 226–233, 2013.

[104] A. Ulusoy, F. Calakli, and G. Taubin. Robust one-shot 3d scanning using loopybelief propagation. In Computer Vision and Pattern Recognition Workshops(CVPRW), 2010 IEEE Computer Society Conference on, pages 15–22, June 2010.

[105] T. van Aardenne-Ehrenfest and N. de Bruijn. Circuits and trees in oriented lineargraphs. In Classic papers in combinatorics, pages 149–163. Springer, 1987.

[106] A. Varol, A. Shaji, M. Salzmann, and P. Fua. Monocular 3D reconstruction oflocally textured surfaces. IEEE transactions on pattern analysis and machineintelligence, 34(6):1118–30, June 2012.

73

[107] A. Velten, T. Willwacher, O. Gupta, A. Veeraraghavan, M. G. Bawendi, andR. Raskar. Recovering three-dimensional shape around a corner using ultrafasttime-of-flight imaging. Nature Communications, 3:745, 2012.

[108] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm basedon immersion simulations. IEEE transactions on pattern analysis and machineintelligence, 13(6):583–598, 1991.

[109] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg. Real-timedetection and tracking for augmented reality on mobile phones. Visualization andComputer Graphics, IEEE Transactions on, 16(3):355–368, 2010.

[110] C. Wang, N. Komodakis, and N. Paragios. Markov Random Field modeling,inference & learning in computer vision & image understanding: A survey.Computer Vision and Image Understanding, 117(11):1610–1627, Nov. 2013.

[111] Wikipedia. Retina Display. http://en.wikipedia.org/wiki/RetinaDisplay/.

[112] B. Williams, G. Klein, and I. Reid. Real-time slam relocalisation. In ComputerVision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8.IEEE, 2007.

[113] Y. Xu and D. G. Aliaga. Modeling repetitive motions using structured light.Visualization and Computer Graphics, IEEE Transactions on, 16(4):676–689,2010.

[114] N. Yabuki, K. Miyashita, and T. Fukuda. An invisible height evaluation system forbuilding height regulation to preserve good landscapes using augmented reality.Automation in Construction, 20(3):228–235, 2011.

[115] H. Yang, G. Welch, and M. Pollefeys. Illumination insensitive model-based 3Dobject tracking and texture refinement. Proceedings - Third InternationalSymposium on 3D Data Processing, Visualization, and Transmission, 3DPVT 2006,pages 869–876, 2007.

[116] H. Yuen, J. Princen, J. Illingworth, and J. Kittler. Comparative study of HoughTransform methods for circle finding. Image and Vision Computing, 8(1):71–77,1990.

[117] L. Zhang, B. Curless, and S. M. Seitz. Rapid shape acquisition using colorstructured light and multi-pass dynamic programming. In 3D Data ProcessingVisualization and Transmission, 2002. Proceedings. First International Symposiumon, pages 24–36. IEEE, 2002.

[118] Z. Zhang. A flexible new technique for camera calibration. Pattern Analysis andMachine Intelligence, IEEE . . . , 1998, 2000.

[119] F. Zheng, R. Schubert, and G. Weich. A general approach for closed-loopregistration in AR. Proceedings - IEEE Virtual Reality, pages 47–50, 2013.

74

[120] F. Zhou, H. B. L. Dun, and M. Billinghurst. Trends in augmented reality tracking,interaction and display: A review of ten years of ISMAR. Proceedings - 7th IEEEInternational Symposium on Mixed and Augmented Reality 2008, ISMAR 2008,pages 193–202, 2008.

75

Appendix A

Hamming-Distance-Based Amendment

1: for i = 0→ `(P )− 1 do

2: for d = 0→ 2 do

3: Initialize

4: for q = 0→ 3 do

5: πvdq = 0

6: end for

7: end for

8: λ = i

9: Assign value to each element of Πv

10: while χ(ϑ(vλ)) = 0 and Nbeast(vλ) 6= NULL do

11: πv11 = ξv(vλ)

12: if Nbwest(vλ) 6= −1 then

13: πv10 = ξv(Nbwest(vλ))

14: end if

15: if Nbnorth(vλ) 6= NULL then

16: πv01 = ξv(Nbnorth(vλ))

17: end if

18: if Nbeast(vλ) 6= NULL then

19: πv12 = ξv(Nbeast(vλ))

20: end if

77

21: if Nbsouth(vλ) 6= NULL then

22: πv21 = ξv(Nbsouth(vλ))

23: end if

24: if Nbnorth(Nbwest(vλ)) 6= NULL then

25: πv00 = ξv(Nbnorth(Nbwest(vλ)))

26: else if Nbwest(Nbnorth(vλ))) 6= NULL then

27: πv00 = ξv(Nbwest(Nbnorth(vλ)))

28: end if

29: if Nbnorth(Nbeast(vλ))) 6= NULL then

30: πv02 = ξv(Nbnorth(Nbeast(vλ)))

31: else if Nbeast(Nbnorth(vλ)) 6= NULL then

32: πv02 = ξv(Nbeast(Nbsouth(vλ)))

33: end if

34: if Nbsouth(Nbeast(vλ))) 6= NULL then

35: πv22 = ξv(Nbsouth(Nbeast(vλ)))

36: else if Nbeast(Nbsouth(vλ))) 6= NULL then

37: πv22 = ξv(Nbeast(Nbsouth(vλ)))

38: end if

39: if Nbsouth(Nbwest(vλ))) 6= NULL then

40: πv20 = ξv(Nbsouth(Nbwest(vλ)))

41: else if Nbwest(Nbsouth(vλ))) 6= NULL then

42: πv20 = ξv(Nbwest(Nbsouth(vλ)))

43: end if

78

44: if Nbeast(Nbeast(vλ))) 6= NULL then

45: πv13 = ξv(Nbeast(Nbeast(vλ)))

46: end if

47: if Nbnorth(Nbeast(Nbeast(vλ))) 6= NULL then

48: πv03 = ξv(Nbnorth(Nbeast(Nbeast(vλ))))

49: else if Nbeast(Nbnorth(Nbeast(vλ))) 6= NULL then

50: πv03 = ξv(Nbeast(Nbnorth(Nbeast(vλ))))

51: end if

52: if Nbsouth(Nbeast(Nbeast(vλ))) 6= NULL then

53: πv23 = ξv(Nbsouth(Nbeast(Nbeast(vλ))))

54: else if Nbeast(Nbsouth(Nbeast(vλ))) 6= NULL then

55: πv23 = ξv(Nbeast(Nbsouth(Nbeast(vλ))))

56: end if

57: Πv =

π00 π01 π02 π03

π10 π11 π12 π13

π20 π21 π22 π23

58: s∗j = arg minsj(H(Πv, sj))

59: vλ = Nbeast(vλ)

60: Create virtual neighbors with color labels if needed

61: if Nbsouth(vλ) = NULL then

62: rvn = a∗1j

63: end if


79

65: rvs = a∗1j

66: end if


68: rvw = a∗0j

69: end if


71: rve = a∗2j

72: end if

73: end while

74: vλ = Nbeast(vλ)

75: end for

80

A markerless augmented reality system using one-shot structured light

Documents

Transcript of A markerless augmented reality system using one-shot structured light