Egomotion Estimation Using Assorted...

16
Int J Comput Vision manuscript No. (will be inserted by the editor) Egomotion Estimation Using Assorted Features Vivek Pradeep · Jongwoo Lim Received: date / Accepted: date Abstract We propose a novel minimal solver for re- covering camera motion across two views of a calibrated stereo rig. The algorithm can handle any assorted com- bination of point and line features across the four im- ages and facilitates a visual odometry pipeline that is enhanced by well-localized and reliably-tracked line features while retaining the well-known advantages of point features. The mathematical framework of our method is based on trifocal tensor geometry and a quaternion representation of rotation matrices. A simple polyno- mial system is developed from which camera motion parameters may be extracted more robustly in the pres- ence of severe noise, as compared to the convention- ally employed direct linear/subspace solutions. This is demonstrated with extensive experiments and compar- isons against the 3-point and line-sfm algorithms. Keywords Visual odometry · SLAM · structure from motion · tracking 1 Introduction Visual odometry is the process of analyzing a sequence of images to determine the position and orientation of a moving camera platform mounted on a mobile robot or human user. Dead-reckoning techniques that employ inertial navigation sensors (accelerometers and gyro- scopes) or wheel-encoders provide good estimates of self-motion only over short distances due to the prob- lem of drift [1]. A small, constant error introduced at V. Pradeep Universty of Southern California, Los Angeles, CA, USA E-mail: [email protected] J. Lim Honda Research Institute, Mountain View, CA, USA E-mail: [email protected] Fig. 1 (Top) Lines and points tracked in our system. (Below) Es- timated visual odometry superimposed on the floor-plan of an in- door office environment (Red: 3-point algorithm, green: proposed algorithm using assorted combinations of point and line features). The 3-point algorithm estimates a path that goes ‘through the wall’ towards the end and incorrectly estimates the camera posi- tion beyond the bounds of the room. one time instant is propagated and grows unbounded over time through the cumulative process of integration. These errors might be due to wheel-slippage, abrupt changes in surface geometry or the highly non-linear dynamics of the underlying platform. Visual odometry provides more robust estimates as it requires only im-

Transcript of Egomotion Estimation Using Assorted...

Page 1: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

Int J Comput Vision manuscript No.(will be inserted by the editor)

Egomotion Estimation Using Assorted Features

Vivek Pradeep · Jongwoo Lim

Received: date / Accepted: date

Abstract We propose a novel minimal solver for re-covering camera motion across two views of a calibratedstereo rig. The algorithm can handle any assorted com-bination of point and line features across the four im-ages and facilitates a visual odometry pipeline thatis enhanced by well-localized and reliably-tracked linefeatures while retaining the well-known advantages ofpoint features. The mathematical framework of our methodis based on trifocal tensor geometry and a quaternionrepresentation of rotation matrices. A simple polyno-mial system is developed from which camera motionparameters may be extracted more robustly in the pres-ence of severe noise, as compared to the convention-ally employed direct linear/subspace solutions. This isdemonstrated with extensive experiments and compar-isons against the 3-point and line-sfm algorithms.

Keywords Visual odometry · SLAM · structure frommotion · tracking

1 Introduction

Visual odometry is the process of analyzing a sequenceof images to determine the position and orientation ofa moving camera platform mounted on a mobile robotor human user. Dead-reckoning techniques that employinertial navigation sensors (accelerometers and gyro-scopes) or wheel-encoders provide good estimates ofself-motion only over short distances due to the prob-lem of drift [1]. A small, constant error introduced at

V. PradeepUniversty of Southern California, Los Angeles, CA, USAE-mail: [email protected]

J. LimHonda Research Institute, Mountain View, CA, USAE-mail: [email protected]

Fig. 1 (Top) Lines and points tracked in our system. (Below) Es-timated visual odometry superimposed on the floor-plan of an in-door office environment (Red: 3-point algorithm, green: proposedalgorithm using assorted combinations of point and line features).The 3-point algorithm estimates a path that goes ‘through thewall’ towards the end and incorrectly estimates the camera posi-tion beyond the bounds of the room.

one time instant is propagated and grows unboundedover time through the cumulative process of integration.These errors might be due to wheel-slippage, abruptchanges in surface geometry or the highly non-lineardynamics of the underlying platform. Visual odometryprovides more robust estimates as it requires only im-

Page 2: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

2

age data corresponding to the environment that is beingtraversed. While there are many exisiting approaches tovisual odometry, the majority of them [2,3] rely on thefollowing sequence of processing steps:

1. Acquisition of input images at different time in-stants.

2. Extraction of ‘interest point’ or salient features fromimage data.

3. Matching of features across image data.4. Filtering out incorrect matches and refining feature

matches.5. Initial estimation of camera motion from matched

feature sets.6. Camera motion refinement by minimizing a cost

function, using the initial estimate to initialize theminimization.

Naturally, the number of features observed, noise-level(in feature localization as well as tracking) and theirdistribution, all have a major impact on the final mo-tion estimate. Due to their abundance in natural scenes,salient corners in image data have been primarily usedas interest points in most visual odometry systems.The development of novel techniques for extracting andmatching these features (such as SIFT [4]) and break-through minimal solvers for 3D pose estimation us-ing point correspondences have led to robust and ef-ficient visual odometry systems [2,5]. In practical set-tings, however, it has been emipirically observed [6] thatleveraging image lines instead of points can lead to im-proved performance in detection/matching (due to mul-tipixel support), occlusion handling and dealing withT-junctions. Furthermore, the abundance of edge fea-tures in man-made environments (cityscapes and indoorstructures) can be exploited to reduce tracking fail-ures significantly, thereby minimizing situations whereodometry systems can get ‘lost’ and also help to recon-struct high-level scene information. On the other hand,it is also well-known that the constraints imposed byline correspondences on camera pose are much weakerthan those provided by points [7] and there is consid-erable ambiguity when dealing with tangent lines fromcurved surfaces like cylindrical columns. In [8], an inter-esting exposition of the complementary noise statisticsand failure modes of line-based and point-based featuretrackers is provided and a robust tracker is built by fus-ing both systems.

Given these conditions, it might be desirable to havea visual odometry algorithm that can incorporate anycombination of point and line features as available inthe image data and yield a camera motion estimate us-ing the combination set that generates the most accu-rate solution (see figure 1). For a real-time and robust

implementation, it is also preferable to have a unifiedframework that, independent of feature type, computesthe six degree of freedom motion from minimal sets overthe available data. We describe a novel and robust mini-mal solver for performing online visual odometry with astereo rig that is ideal for use in such a hypothesize-and-test setting [9]. The proposed method can compute theunderlying camera motion given any arbitrary, mixedcombination of point and line correspondences acrosstwo stereo views. This facilitates a hybrid visual odom-etry pipeline that is enhanced by well-localized andreliably-tracked line features while retaining the well-known advantages of point features. Utilizing trifocaltensor geometry and quaternion representation of ro-tation matrices, we develop a polynomial system fromwhich camera motion parameters can be robustly ex-tracted in the presence of noise. We show how the morepopular approach of using direct linear/subspace tech-niques fail in this regard and demonstrate improvedperformance using our formulation with extensive ex-periments and comparisons against the 3-point and line-sfm algorithms. An earlier description of this algorithmwith preliminary results first appeared in [10].

Since this work is motivated by an application forautonomous navigation, the visual input and underly-ing geometry of our algorithm stem from a calibratedstereo pair. This simplifies the task of motion recov-ery and facilitates scale observability. Given any set ofmatches containing at least three points or two lines ortwo points and one line across two stereo views, ouralgorithm can compute the underlying camera motionusing the same solver. This approach is much more ele-gant than simply integrating the state-of-art line-basedand point-based systems and enables the evaluation ofassociated costs in a unified RANSAC setting. Usinga pair of calibrated trifocal tensors, we form a low-degree polynomial system of equations that enforcesthe orthonormality constraint by representing rotationsby unit quaternions. This is a different algebraic ap-proach from point-based minimal solvers of [11], wherethe orthonormality constraints are explictly enforced.The quaternion representation has a significant impacton the noise performace of our algorithm and is aninteresting result when taking into account the well-documented problem of recovering consistent cameramotion from noisy trifocal tensors. The key contribu-tions of this paper are:

– A novel quaternion-based geometrical formulationand polynomial solver for any combination of pointor line feature correspondences over two stereo viewsfor robustly computing camera motion.

– Extensive experiments using synthetic and real datademonstrating the usefulness of using both point

Page 3: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

3

and line features in visual odometry as opposed toa single feature type.

The rest of this paper is organized as follows. In Section2, we briefly review relevant work done for computingcamera motion using different feature types. We alsoreiterate how errors in feature tracking propagate to er-rors in camera pose estimation. The geometrical frame-work of our algorithm is presented in Section 3 andtwo methods, that mirror popular approaches in solvingsuch problems, are presented in Section 4. A brief de-scription of why these methods are not successful in thepresence of significant noise is followed by a derivationof our quaternion-based polynomial solver in Section 5.In Section 6, experimental results are presented thatdemonstrate improved performance with our proposedalgorithm. Finally, we conclude and present directionsfor future work in Section 7.

2 Related Work

2.1 Visual Odometry with Point Features

While one can perform dense tracking of image features(where almost every pixel in the image is tracked tothe next image in the sequence), sparse feature tracking(where a relatively smaller set of a few hundred featuresare detected and tracked) is more stable, requires lesscomputation, and is adequate for camera motion esti-mation. The three-point method [12,2] is currently themost popular algorithm for performing visual odometry(from feature points) with stereo cameras. For such asetup, tracking only a few distinct features is more rele-vant, as in realistic conditions, stereo triangulation doesnot provide depth data for every pixel in the image.However, since the polynomial constraint for derivingthe motion parameters is set up using the triangle lawof cosines, this approach works only for a configurationof three points in general position and is therefore usedin a RANSAC [9] framework for establishing support.Other methods, which address monocular schemes too,solve polynomial equation systems that are establishedfrom geometrical constraints robustified by enforcingalgebraic conditions (like rotation matrix orthonormal-ity). Several flavors of such algorithms [13,11,14] existand these vary in mathematical structure, assumptionson available camera calibration and even in the minimalnumber of correspondences required for a feasible solu-tion. Research in this area has also resulted in the devel-opment of numerically efficient and stable methods forsolving the corresponding polynomial systems and pop-ular techniques include Groebner basis [15], polynomialeigen-value problem (PEP) [16] and the hidden-variable[17].

To summarize, visual odometry employing the threepoint algorithm is quite popular in the computer visionand robotics communities when using stereo camera se-tups. However, point feature tracking is challenging inlow-textured areas (indoor structures such as corridors)and under varying illumination circumstances. In thispaper, the performance of the algorithm we propose iscompared against the three point method and in situa-tions where sufficient points are available, we endeavorfor the accuracy of the two approaches to be at leastcomparable.

2.2 Visual Odometry with Line Features

Traditionally, line features have been employed in struc-ture from motion algorithms using the multifocal tensorframework [18,19]. The trifocal tensor is a 3×3×3 cubeoperator that expresses the (projective) geometric con-straints between three views independent of scene struc-ture. However, it can be computed given at least 13 lineor 7 point correspondences. In general, the trifocal ten-sor has 27 parameters (26 up to a scale), but only 18degrees of freedom (up to projective ambiguity). Theremaining 9 constraints must be enforced to obtain aconsistent solution. [20] introduces a cubic polynomialsystem for extracting the tensor from 6 points only anda method to uncover the underlying camera matrices ispresented in [7]. The latter also introduces, for the firsttime, a closed-form linear solution from a combinationof point or line features. For a calibrated setting, thesematrices can be decomposed to obtain the camera ori-entation and perform visual odometry. The four-viewextension of this concept, called the quadrifocal ten-sor, was investigated in [21]. From a purely geometricalstandpoint, the work described in [22] is the most sim-ilar to our approach (although it does not address theproblem of a minimal solver for mixed combinations offeatures), as it exploits the known pose between thestereo pair in the quadrifocal tensor to enforce con-straints between image intensities of adjacent stereopairs. It is evident from all these works, however, thatenforcing non-linear dependencies within the tensor in-dices requires substantial book-keeping [23,24] and isultimately too cumbersome for estimating a six degreesof freedom motion. Another approach, which is alge-braically similar in methodology to our work and thatof [11] for point-feature based odometry is describedin [25], which constructs a small low-degree polyno-mial system and explicitly enforces orthonormality con-straints.

A primary hurdle in implementing line-based visualodometry systems has been the lack of reliable and at

Page 4: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

4

the same time, efficient, line detection and tracking al-gorithms. As part of our experimental validation, wealso describe our CUDA-based line tracking system tocomplement our algorithm. While not a primary con-tribution of this paper, we feel that this can serve asa guideline to other researchers for incorporating linesinto their visual odometry pipelines.

2.3 Visual Odometry with Assorted Features

A unified representation for points, lines and planes byrepesenting linear features as Gaussian density func-tions was presented in [26]. However, as mentioned bythe authors themselves, this representation falls shortof being directly applicable for a motion-and-structurerecovery framework due to unresolved issues in defin-ing join operations and performing inverse projections.Some attempt has been made towards integrating pointand line features for the perspective pose estimationproblem [27,28]. A factorization based multi-frame SfMalgorithm also utilizing point and line features is pre-sented in [29]. The proposed method, however, is re-stricted to cases where translation is very small andinvolves iterative recomputations of rotation and trans-lation parameters. To the best of our knowledge, there-fore, no unified closed-form formulation for dealing withpoint and line features exists in the context of realtime visual odometry with RANSAC and for multifo-cal approaches, there is room for improvement in termsof noise performance. We endeavor to present a poly-nomial solver for performing visual odometry that isessentially transparent to the kind of input features(points or lines). As described in the next section, thecomplementary noise characteristics of the two kindsof feature sets helps provide resilience to the odometrysystem under challenging circumstances.

2.4 Noise Characterization

The algorithms developed in this paper are motivatedby errors in feature tracking, localization and represen-tation that can be modeled as noise in the visual odom-etry estimation process. While point features are robustto large motions, the appearance can change substan-tially under large aspect changes or motion blur due tojerky camera motions. SIFT and SURF features, for in-stance, have been reported to have viewport invarianceof up to 30 degrees only in practice. If using only opticflow (for a real-time implementation), feature trackingcan be less resilient to large camera displacements. Fur-thermore, an image is essentially a discretized represen-tation of the projection of a real-world scene onto the

camera sensor. Naturally, there are quantization andround-off errors associated with pixelization. Featuredetection techniques such as SIFT rely on finding pointsof extrema in local neighborhoods, but in the simplestcase, these neighborhoods are discrete. In general, sub-pixel refinement is required to find the true points ofmaxima/minima by employing some form of interpola-tion. The errors due to pixel quantization, however, areindependent and can be modeled using Gaussian dis-tributions. The camera pose that is estimated from acollection of such ‘noisy’ features is also a Gaussian dis-tributions, from direct application of the central limittheorem.

Line features are extremely invariant to illuminationchanges. While a desirable property for good features,this also makes the task of matching across views verydifficult. Due to the stability of the edges, the associateddescriptor (typically the gradient pattern of the imageintensity) is not unique and other techniques need tobe employed. It is evident from such a strategy thatincorrect matches will often be strongly correlated andthe resulting camera pose will not approach a Gaussian.In [8], this distribution is modeled as a two componentGMM (Gaussian Mixture Model).

A visual odometry algorithm that does not considerthe particular feature type, but only the level of noise inthe input feature sets will best exploit the complemen-tary behavior of point and line features. This can beaccomplished by randomly selecting minimal sets overall detected point and line features and using them in ahypothesize-and-test setting (such as RANSAC) to se-lect the best solution. Naturally, the minimal solver inthis setting should be able to handle any mixed com-bination of features. In the remainder of this paper,this problem is formally posed and various solutionsdescribed. The quaternion-based direct solver proposedprovides the best performance in terms of robustnessand is experimentally validated against the popularlyused line and point based visual odometry algorithms.

3 Problem Formulation

Notation. Unless otherwise stated, a 3D point in spaceis represented by a homogeneous 4-vector denoted byX = [X1, X2, X3, X4]

> ∈ R4 and its projection on theimage plane of camera i by a homogeneous 3-vectorxi =

[xi1, x

i2, x

i3

]> ∈ R3. Similarly, any projection of a3D line L on the image plane is denoted by the param-eters li =

[li1, l

i2, l

i3

]>. A projective camera is given bya 3×4 matrix K [R t], with K being the 3×3 internalcalibration matrix and the 3×3 rotation matrix R and3×1 translation vector t representing the exterior cam-

Page 5: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

5

P0 =�I 0

P1 =�R0 t0

P2 =�R t

P3 =�RR0 R0t + t0

Ti = Ri0t

� − t0R�i

P =�

I 0�

P� =�

R t�

�Ma

�fa(R, t) = 0

�Mb

�fb(R, t) = 0

q = a + bi + cj + dk

� „b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

R =

r�1r�2r�3

=

r11 r12 r13

r21 r22 r23

r31 r32 r33

1 − 2(c2 + d2) 2(bc − ad) 2(ac + bd)2(ad + bc) 1 − 2(b2 + d2) 2(cd − ab)2(bd − ac) 2(ab + cd) 1 − 2(b2 + c2)

1

P0 =�I 0

P1 =�R0 t0

P2 =�R t

P3 =�RR0 R0t + t0

Ti = Ri0t

� − t0R�i

P =�

I 0�

P� =�

R t�

�Ma

�fa(R, t) = 0

�Mb

�fb(R, t) = 0

q = a + bi + cj + dk

� „b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

R =

r�1r�2r�3

=

r11 r12 r13

r21 r22 r23

r31 r32 r33

1 − 2(c2 + d2) 2(bc − ad) 2(ac + bd)2(ad + bc) 1 − 2(b2 + d2) 2(cd − ab)2(bd − ac) 2(ab + cd) 1 − 2(b2 + c2)

1

P0 =�I 0

P1 =�R0 t0

P2 =�R t

P3 =�RR0 R0t + t0

Ti = Ri0t

� − t0R�i

P =�

I 0�

P� =�

R t�

�Ma

�fa(R, t) = 0

�Mb

�fb(R, t) = 0

q = a + bi + cj + dk

� „b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

R =

r�1r�2r�3

=

r11 r12 r13

r21 r22 r23

r31 r32 r33

1 − 2(c2 + d2) 2(bc − ad) 2(ac + bd)2(ad + bc) 1 − 2(b2 + d2) 2(cd − ab)2(bd − ac) 2(ab + cd) 1 − 2(b2 + c2)

1

P0 =�I 0

P1 =�R0 t0

P2 =�R t

P3 =�RR0 R0t + t0

Ti = Ri0t

� − t0R�i

P =�

I 0�

P� =�

R t�

�Ma

�fa(R, t) = 0

�Mb

�fb(R, t) = 0

q = a + bi + cj + dk

� „b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 +

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«2+ (u + v)2 − 2 c0

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«(u + v) − 2 c1

„b0 (u − v) + b2 −

qb0

2 (u − v)2 + 2 b0 (u − v) b2 + b22 − (u − v)2 − b3 + 2 b1 (u − v)

«− 2 c2 (u + v) + c3

R =

r�1r�2r�3

=

r11 r12 r13

r21 r22 r23

r31 r32 r33

1 − 2(c2 + d2) 2(bc − ad) 2(ac + bd)2(ad + bc) 1 − 2(b2 + d2) 2(cd − ab)2(bd − ac) 2(ab + cd) 1 − 2(b2 + c2)

1

point-line-line

x1 l2 l3as 2 point-line-lines

point-point-point

as 2 point-line-lines

line-line-line

(a) (b) (c)

Fig. 2 (a) Stereo geometry for two views and the point-line-line configuration, and (b) point-point-point and (c) line-line-line con-figuration in terms of 2 point-line-lines configuration. Note that a point-point-point in a generic trifocal setup corresponds to 4point-line-lines, but in a stereo rig, only 2 are linearly independent.

era orientation and position. In the remainder of thispaper, without loss of generality, we assume image co-ordinates projected onto the z = 1 plane, and therefore,K is set to be the identity matrix I . The trifocal ten-sor T = {Ti}i=1,2,3, with each 3×3 submatrix denotedby Ti is the geometric object of interest in this work.The [...]× notation denotes the skew-symmetric matrixfor forming a vector cross-product. We also use A(i) torepresent the i-th column of a matrix A.

3.1 Calibrated Stereo Rig Geometry

A binocular stereo rig (see figure 2) can be representedin its canonical form, with the left and right cameramatrices P1 = [I 0] and P2 = [R0 t0] respectively.Here, (R0, t0) encodes the rigid geometry of the rig,and is fixed and known a-priori by calibration. Afterundergoing arbitrary rotation R and translation t dueto some unknown motion, the corresponding camerasin the same coordinate system can be written as:P3 = [R t] , (1)P4 = [R0R R0t+ t0] . (2)

The goal of the visual odometry algorithm is to es-timate the motion (R, t). These equations can be re-cursively applied to every pair of stereo views beforeand after motion, and by concatenating the inter-framemotions, the complete trajectory of the camera maybe recovered. These representations can be simplifiedfurther by taking into consideration that for a rectifiedstereo pair (without loss of generality, as any stereo paircan be rectified), R0 = I and t0 = [tx, 0, 0]

>, where txis given by the baseline.

3.2 Review of the Trifocal Tensor

Consider a canonical three camera configuration givenby P1 = [I 0] , P2 = [RA tA] and P3 = [RB tB ].

As described in [30], the trifocal tensor for these threeviews, T = {Ti} is given by

Ti = R(i)A t>B − tAR

(i)>B , (3)

where R(i) denotes the i-th column of a matrix R. Letus assume that we have determined a line correspon-dence l1 ⇔ l2 ⇔ l3 and/or a point correspondencex1 ⇔ x2 ⇔ x3 across P1, P2, P3 respectively. The rela-tionships between these features and the tensor can beexpressed by the following equations:

l1i = l2>Til3, (4)

[x2]×

(∑

i

x1iTi

)[x3]× = 03×3. (5)

In the general case, a single point triplet generates fourand a line-line-line correspondence provides two lin-early independent constraints over the tensor param-eters. Geometrically, both these configurations are bestunderstood in terms of a hypothetical point-line-linecorrespondence x1 ⇔ l2 ⇔ l3 (see figure 2(a)). The con-straint expressed by this arrangement is simply the inci-dence relationship between x1 and the line transferredfrom l3 via the homography induced by l2. For a linetriplet therefore, each of any two points (figure 2(c)) onthe line in the first image establishes a point-line-lineconfiguration giving two equations, while for a case ofthree matching points (figure 2(b)), we have four possi-ble permutations obtained by choosing any two linearlyindependent lines spanned by the points in the P2 andP3.

3.3 Forming the System of Equations

The geometric framework of our algorithm is composedof two trifocal tensors; T L =

{TLi

}arising out of image

correspondences between cameras P1, P2, P3 and T R =

Page 6: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

6

{TRi

}from correspondences between P1, P2, P4. These

tensors, using equations (3) and (2), are given as

TLi = R

(i)0 t> − t0R

(i)>, (6)

TRi = R

(i)0 (R0t+ t0)

> − t0 (R0R)(i)>

. (7)

Since the stereo configuration is fixed and known, itis only required to estimate the twelve parameters of theunderlying motion R and t to fix T L and T R. From cor-respondence sets of the form

{l1 ⇔ l2 ⇔ l3; l1 ⇔ l2 ⇔ l4

}

or{x1 ⇔ x2 ⇔ x3; x1 ⇔ x2 ⇔ x4

}one can write a lin-

ear system by concatenating the equality constraints inequations (4) and (5) in terms of the twelve unknowns.We can thus form

Ay = 0, (8)

y = [ r11, r21, r31, r12, r22, r32, r13, r23, r33, t1, t2, t3, 1 ]>,

where rij is the (i, j)th element of the rotation matrix

R and translation t = [t1, t2, t3]>. Equation (8) links

image-based data and specific components of the trifo-cal tensors encoding a critical rotation and translationof the rig. The special forms of the trifocal tensors pre-sented earlier imply that this system is linear in thecompound parameter encapsulating the relevant rota-tion and translation. The linear system in equation (8)will be inhomogeneous due to the form of equation (7).

4 Linear and Subspace Solutions

For the sake of illustration, we would like to draw a com-parison between (8) and the familiar epipolar equationin the two-view case. The techniques presented hereare parallel to those for retrieving a fundamental ma-trix from a number of point correspondences. The lin-ear solution we present in section 4.1 is analogous to theeight point algorithm for linearly finding a fundamentalmatrix from eight point matches, while the method insection 4.2 is an analogue of the non-linear seven-pointalgorithm for finding not just one but three candidatefundamental matrices form seven point matches [31].The multi-element nature of the solution obtained bynon-linear methods with smaller data sets is exactlywhat we imply by usage of the term "minimal solver"in sections 4.2 and 5 for arbitrary small combinationsof points and lines.

4.1 The Linear Solution

We now geometrically derive the minimum number offeature correspondences (over the four views) requiredto solve equation (8). Let us consider x1 ⇔ x2 ⇔ x3

first. With P1 and P2 fixed, a corresponding 3D pointX

is defined from the first two views. Thus, x3 = P3X pro-vides only two linearly independent equations for theunknown (R, t). Therefore, one point correspondenceacross four views

{x1 ⇔ x2 ⇔ x3; x1 ⇔ x2 ⇔ x4

}must

generate 4 linearly independent equations. However, P3

and P4 form a stereo pair and the following holds forany stereo pair i and j:

xi3xj2 − xi2xj3 = 0. (9)

Thus, on concatenating point correspondence constraintsfrom T L and T R, only 3 linearly independent equationsare obtained. Arguing similarly but noting that equa-tion (9) is not invoked for general points on matchinglines, it can be shown that a line quartet provides 4 lin-early independent equations. These dependencies canalso be seen by performing row operations on A matrixin (8). Hence, given n point- and m line-correpondencesets, matrix A has 3n + 4m independent rows. A lin-ear solution, therefore, can be obtained for {4 points}or {3 lines} or {3 points+1 line} (overconstrained) or{2 points+2 lines} (overconstrained). In the presence ofnoise, it is recommended to use more than the minimumnumber of equations per correspondence. However, withnoisy features, this approach is also not recommendedbecause R will not be obtained as an orthonormal ro-tation matrix. One could refine the solution further byminimizing the Frobenius norm with respect to an or-thonormal matrix, but better techniques are presentedin sections 4.2 and 5.

4.2 The Subspace Solution

The system presented in equation (8) can be solvedby computing the null vectors of matrix A only whenthe data is perfectly noiseless. In the presence of noise,the solution is composed of eigenvectors with possiblenon-zero eigenvalues. We explore the latter case in thissection, where instead of solving equation (8) exactly,a least-square solution is attempted. This approach issimilar to the works described in [11,16,17]. Using non-linear techniques, it is possible to solve for y in (8) froma combination of only {3 points} or {2 points + 1 line}or {2 lines + 1 point}, each of which provides 9, 10 or11 linearly independent equations. Note that geometri-cally speaking, the actual minimal sets are {2 line} and{1 line + 1 point}. The {2 line} solution can also beobtained (8 equations), but is unstable in the presenceof noise. Similarly for the {1 line + 1 point} case. Wetherefore ignore these conditions and solve nonlinearlygiven at least 3 correspondences 1. Having a uniform

1 The {3 lines} case has a linear solution, but in the subspacesolution, the orthonormality of the rotation matrix is enforced.

Page 7: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

7

minimal set set also makes implementation easy in aRANSAC setting, where the solver can be transparentto the input feature configuration. With this formula-tion, the least number of constraints we might expectis 9 equations from {3 points} (for 12 unknowns), andso, the solution can be given by a 3 dimensional sub-space (=12–9) . However, as the system of equations isnon-homogeneous, we write,

y = yp + αy1 + βy2 + γy3, (10)

where yp is the so-called ‘particular solution’ of a non-homogeneous linear system that can be computed fromthe psuedoinverse of A and y1, y2, and y3 are the eigen-vectors of A corresponding to three smallest eigenval-ues (often computed by singular value decomposition onA). To solve for α, β and γ and simultaneously ensureorthonormality for the rotation part the six polynomialconstraints expressing unit norm for each column of Rand orthogonality between any two columns can be ap-plied:

R(i)>R(i) = 1 and R(i)>R(j) = 0, for i, j ∈ {1, 2, 3}.The resulting polynomial system of equations may thenbe solved by a variety of techniques such as Groebnerbasis [32] or polynomial eigenvalue problem (PEP) [17].Here, the PEP solution is outlined. For a gentle intro-duction to this technique, and example usage for five-point (and six-point) algorithms, please refer to [16].

Imposing the orthogonality and unit determinantconstraints on the rotation matrix components of y inequation (10) generates six polynomial equations of de-gree two in α, β and γ. In particular, 10 monomial terms(α2, β2, γ2, αβ, βγ, αγ, α, β, γ, and 1) occur in this sys-tem. Choosing γ arbitrarily, the six constraints can bewritten as follows:

(γ2C2 + γC1 + C0)v = 0 . (11)

where v =[α2, β2, αβ, α, β, 1

]> and C2, C1 and C0 are6 × 6 coefficient matrices with particular forms suchthat the elements in first five columns in C2 and firstthree columns in C1 are all zeros. Equation (11) is eas-ily solved, for instance, using the function polyeig inMATLAB. The eigenvalues and eigenvectors returnedby this function provide solutions for α, β and γ forcomputing y.

While better than the linear approach, the subspaceof A is still very unstable in the presence of noise. Thisinstability can be traced back to the underlying noisytensor constraints that do not faithfully encode a cam-era representing rigid motion. This problem of robustlyextracting camera matrices from the trifocal tensor hasbeen reported elsewhere in [7] and a work-around isdescribed that requires estimating epipoles and apply-ing additional constraints on the camera matrices. We

describe a method in the next section, our main contri-bution, that yields a much more robust solution with-out requiring any additional information or imposingfurther constraints. In section 6, we provide further ex-perimential evidence that justifies why the next methodmight be preferred over the subspace solution.

5 Quaternion-based Polynomial Solver

The method presented in the previous section imposesorthonormality constraints through equations on thecoefficients of the eigenvectors in equation (10). Thisis similar to solving a constrained optimization prob-lem. In this section, we present our method that re-lies upon a quaternion-parameterization of the rotationmatrix. This casts the problem as an unconstrained op-timization problem obtained by reducing the originalconstrained problem through an appropriate parame-terization of the feasible set. This kind of parameteriza-tion has been used in the past, for example to computeabsolute orientation from minimal sets of three pointsfor a stereo rig [33]. While our method and the subspacemethod solve one and the same problem, the place (atthe level of the formulae), where noise is added doesmatter for the final outcome. This is why, as we willprove later, the method presented in this section yieldsbetter results. A rotation matrix R can be parameter-ized in terms of the unit quaternion q = a+bi+cj+dk:

R =

a2 + b2 − c2 − d2 2(bc− ad) 2(ac+ bd)

2(ad+ bc) a2 − b2 + c2 − d2 2(cd− ab)2(bd− ac) 2(ab+ cd) a2 − b2 − c2 + d2

.

In a similar way of building equation (8) but with thisparameterization, we get the following system of equa-tions:

By′ = 0, (12)

y′ =[a2, ab, ac, ad, 1, b2, c2, d2, bc, bd, cd, t1, t2, t3

]>.

The specific ordering of terms in y′ is for notationalsimplicity in further derivations.

We will first solve for the quaternion parameters. Tothis end, using the equations in (12), (t1, t2, t3) are ex-pressed in terms of the other parameters by performingthe following steps:

Solve[B(12), . . . , B(14)

]Mt =

[B(1), . . . , B(11)

], (13)

Bq.=[B(1), . . . , B(11)

]−[B(12), . . . , B(14)

]Mt. (14)

Then we get a reduced system of equations:

Bqy′q = 0, (15)

Page 8: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

8

y′q =[a2, ab, ac, ad, 1, b2, c2, d2, bc, bd, cd

]>.

The translation part is given by[t1 t2 t3

]>= −Mty

′q,

and we now have to only solve the polynomial system(15). At this point, we also introduce the quaternionconstraint a2 + b2 + c2 + d2 = 1, by appending an ad-ditional row [1, 0, 0, 0,−1, 1, 1, 1, 0, 0, 0] to Bt, to ensurethat in the presence of noise, a consistent rotation isobtained. To solve for y′q, we adopt the method of elim-ination by writing the monomials

(b2, c2, d2, bc, bd, cd

)

in terms of(a2, ab, ac, ad, 1

):

b2

c2

d2

bc

bd

cd

=Mq

a2

ab

ac

ad

1

, (16)

where[B(6)

q , . . . , B(11)q

]Mq =

[B(1)

q , . . . , B(5)q

]. (17)

There must be sufficient correspondences so thatequation (17) is not underconstrained. The submatrix[B

(6)q , . . . , B

(11)q

]must have at least rank 6, implying

at least 6 linearly independent rows. Since one indepen-dent constraint is already provided by quaternion unitnorm condition, 5 or more independent rows must comefrom the point or line correspondences in Bq. However,we note that the manipulation in (14) introduces de-pendencies in Bq and therefore, minimal configurationsets are {3 points} or {2 lines} or {2 points + 1 line}.It is worth pointing out this rank deficiency of Bq pre-vents us from solving directly for (15). The {2 lines}configuration tends to be less stable in the presence ofnoise, and therefore, we employ {3 lines} and include{2 lines + 1 point} sets. From a systems point of view,this also lets use a uniform set size of 3 features forany combination in a RANSAC setting. The minimal 3feature criterion ensures that it will be always possibleto solve equations (13) and (17). Now, the equations in(16) can be rewritten as

b2

c2

d2

bc

bd

cd

=

[1] [1] [1] [2]

[1] [1] [1] [2]

[1] [1] [1] [2]

[1] [1] [1] [2]

[1] [1] [1] [2]

[1] [1] [1] [2]

b

c

d

1

, (18)

where [i] represents an ith degree polynomial in a. Wenote that not all terms in the LHS of equation (18)are independent and they should in fact, satisfy thefollowing conditions:(bc)2 = (b2)(c2), (bc)(bd) = (b2)(cd),

(bd)2 = (b2)(d2), (bc)(cd) = (c2)(bd),

(cd)2 = (c2)(d2), (bd)(cd) = (d2)(bc).

(19)

Applying the RHS in (18) to the six constraints in(19), each constraint gives a 3-degree polynomial in a,and we obtain the following final system:

[3] [3] [3] [4]

[3] [3] [3] [4]

[3] [3] [3] [4]

[3] [3] [3] [4]

[3] [3] [3] [4]

[3] [3] [3] [4]

b

c

d

1

= 0. (20)

Since by hypothesis, there exists a solution to [b, c, d, 1]>,any 4×4 submatrix in (20) must have a determinantequal to 0. This gives a 13-degree polynomial in a, andin rectified stereo case (R0 = I, t0 = [tx, 0, 0]

>), it is inthe following specific form:

(k1a10 + k2a

8 + k3a6 + k4a

4 + k5a2 + k6)a

3 = 0. (21)

Let α = a2 :

k1α5 + k2α

4 + k3α3 + k4α

2 + k5α+ k6 = 0. (22)

This 5-degree polynomial can be easily solved and thereal-values a’s retained. Thus, the rotation matrix R

can be composed after (b, c, d) are obtained from thenull-vector of the 4×4 submatrix in (20) and then thetranslation

[t1 t2 t3

]>= −Mty

′q.

The multiple solutions are not a big problem asthe motion parameters with the largest support will bepicked in a RANSAC setting. Compared to the stan-dard 3-point algorithm, one can employ more than justthe minimal number of feature correspondences to ob-tain better motion hypotheses in the presence of noise(using more features, as we will see, outperforms the3-point algorithm). This can potentially lead to morenumber of trials due to an increased probability of addingan outlier feature in the minimal set. However, in prac-tice, we found that larger feature sets generate betterhypotheses and this usually offsets negative influencesof the chance outliers (see Section 6). Furthermore, oncethe best solution and corresponding inliers have beenfound, they can all be plugged into equation (12) torecompute a more consistent solution.

6 Experiments

In this section, we report experimental results for syn-thetic and real data. Synthetic data experiments areperformed to compare the results obtained using thedifferent versions of our trifocal formulations - trifo-cal linear (Section 4.1), trifocal subspace (Section 4.2)and trifocal quaternion–n (Section 5). The n suffix in-dicates the number of feature correspondences used inthe solver (with n = 3 being the minimal case). We alsoperform comparisons against the popularly used three

Page 9: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

9

Fig. 3 Rotation and translation errors comparing performance of linear (red), subspace (green) and quaternion (blue) based formu-lations of the trifocal minimal solver using arbitrary feature combinations.

point algorithm [12] and the line-only based motion es-timation setup described in [25], which we refer to inthe experiments as ‘LinesSfM’. We believe that a mech-anism to incorporate more than just the minimal set ina closed-form for generating a single hypothesis is oneof the advantages of our method, when compared tothese algorithms. This is demonstrated in the variousexperiments below by setting n = 4 or 5. Since the de-velopment of a mixed-feature solver was motivated bypractical difficulties in sparsely textured or badly illu-minated indoor environments, we also performed somereal experiments that exhibit such pathological behav-ior to evaluate our algorithms against state of the art.For ease of reference, the various feature combinationssupported by different versions of our algorithm aresummarized in Table 1.

6.1 Synthetic Data

We quantify the performance of the different algorithmsdiscussed in the paper across various noise levels. Avirtual stereo camera pair is created by fixing the base-line to 7.5 cm (similar to that of human eyes). Thesecond stereo pair is displaced with a random motion.3D scene geometry and image correspondences are alsocreated by randomly generating 3D points and lines inspace, and projecting them onto the four cameras. Zeromean, Gaussian noise with varying standard deviationsis added to the image coordinates of lines and pointsin each view. Furthermore, similar to [25,11], we useonly the lower quartile of error distributions for all algo-rithms as the targeted use is in a RANSAC frameworkwhere finding a fraction of good hypotheses is more im-portant than consistency.

Figure 3 shows the rotation and translation errorsfor the various flavors of our trifocal formulation, givendifferent minimal sets. Random permutations of points

Algorithm Feature Combinations (points,lines)Trifocal Linear (4,0), (3,1), (2,2), (0,3)

Trifocal Subspace (3,0), (2,1), (1,2), (0,3)Trifocal Quaternion-3 (3,0), (2,1), (1,2), (0,3)Trifocal Quaternion-4 (4,0), (3,1), (2,2), (1,3), (0,4)Trifocal Quaternion-5 (5,0), (4,1), (3,2), (2,3), (1,4), (5,0)

Table 1 Overview of the feature combinations supported by dif-ferent flavors of the trifocal constraint based solver.

Fig. 6 Number of RANSAC trials before an acceptable solutionwas found for trifocal quaternion -3 (red), -4 (green) and -5 (blue).

and lines were generated for each trial. As alluded toearlier, the linear approach has the worst performance,while the quaternion-based polynomial system worksthe best. These results are based on 1000 trials at eachnoise level. We next compare performance using onlypoints against the standard three-point algorithm (fig-ure 4) and using only lines (figure 5) against LinesSfM.For both figures, the plots in (a) and (b) demonstratelower quartile errors in the presence of noise for thevarious experiments after conducting 5000 trials. Forthe points-only case, the quaternion-3 method showsslightly inferior error behavior compared to the three

Page 10: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

10

(a) (b)

(c) (d)

(e) (f)

Fig. 4 Comparison of different flavors of the trifocal quaternion algorithm against the three point algorithm. (a) and (b) show lowerquartile rotation and translation errors; (c) and (d) present translation deviation from the true values for two kinds of motion - forwardcamera motion (c) and sideways motion (d). Finally, average rotation and translation errors after RANSAC trials are shown in (e)and (f).

Page 11: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

11

(a) (b)

(c) (d)

(e) (f)

Fig. 5 Comparison of different flavors of the trifocal quaternion algorithm against the LinesSfM algorithm. (a) and (b) show lowerquartile rotation and translation errors; (c) and (d) present translation deviation from the true values for two kinds of motion - forwardcamera motion (c) and sideways motion (d). Finally, average rotation and translation errors after RANSAC trials are shown in (e)and (f).

Page 12: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

12

(a) (b)

Fig. 7 Translation deviation errors with increasing baseline for (a) point feature based odometry and (b) line feature based odometry.

point algorithm, but the quaternion-4 and quaternion-5 methods outperform. Using only lines, all versions ofthe trifocal method do significantly better.

Similar to the validation performed in [11], we alsomeasure the deviation of the estimated translation di-rection from the true value. These graphs are providedin subfigures (c) for forward camera motion and (d) forsideways camera motion in figures 4 and 5. These areagain lower quartile errors after 5000 simulations as afunction of noise level. Visual odometry systems gener-ally employ some form of non-linear refinement - suchas sparse bundle adjustment - after obtaining the in-lier feature set from an initial solution using a minimalsolver (the three point, linesfm or our algorithm). Inour experimental evaluation, we use a descent algorithm(quasi-Newton Hessian) that minimizes the geometricerror over the point and line feature image projections.More specifically the distance to the reprojected pointsor lines from the observed feature point or the end-points of line segments, both in non-linear minimiza-tion and RANSAC inlier selection. Results after thisrefinement step are also displayed to demonstrate thebetter accuracy that can be accomplished with corre-spondingly better initial estimates.

To complete the evaluation of the solver as part of avisual odometry pipeline, average errors in a RANSACsetting are plotted in subfigures (e) and (f) for bothfeature types. 100 point features and 30 line featureswere generated for the 5000 simulations.

Since we are proposing a scheme that can incorpo-rate more than the minimal set, an evaluation of itsimpact on the number of RANSAC trials is presentedin figure 6. For this test, a random set of 30 points and15 lines was generated and RANSAC trials by select-ing random combinations of features were carried outuntil a threshold reprojection error (in all four images)

was reached. ANOVA tests between the three groupsat each noise level found no significant difference, butthese results might differ for real data.

Finally, the impact of stereo camera baseline ontranslation deviation errors is shown in figure 7. Threebaselines were simulated - 7.5 cm as before, 15 cm and30 cm. Increasing baseline reduces the error, as is wellknown in literature (the simulation program ensuresthat appropriate feature matches are generated) but thetrends discussed so far persist for both feature types.Note that, for the sake of fairness, we only use the par-ticular class of features the benchmark methods havebeen designed to work with, though we argue that in-corporating mixed feature combinations adds an addi-tional dimension to the odometry pipeline with poten-tial for greater accuracy. It makes more sense to demon-strate this behavior with real data, as line and pointfeatures have different error statistics with significantimpact on performance, and are difficult to replicate insimulated experiments. We do so in the next section.

6.2 Real Data

We have implemented a visual odometry system thatuses the standard 3-point algorithm and the proposedtrifocal-quaternion algorithm. It takes a stereo videostream as input, performs stereo rectification, point andlines feature detection and tracking, and computes themotion using the tracked features. Here, we briefly de-scribe some of the implementation details. For pointfeatures, we follow the steps outlined in for KLT fea-ture tracking [34].

Detecting and tracking lines in video is in generalmuch harder than points. There are a few line detectorsavailable (e.g. [35]), but they are not fast enough for

Page 13: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

13

(a)

(b)

Fig. 8 Runtime breakdown of experimental visual odometry sys-tem with (a) point features ~20 fps (b) point and line assortedfeatures ~9 fps. Both implementations use the trifocal quaternionbased solver.

the on-line processing of stereo video streams. We de-velop an improved version of line detector/tracker pro-posed in [25]. Instead of using fixed grids as in [25], theline segment seeds are initialized at the local minima ofthe edge response map, then line segments are grownfrom the seeds (figure 9 (a)). The line segments are thenmerged together to form straight lines, using the samecriteria in [25] which is the approximate thickness ofthe convex hull of the line segments (figure 9 (b)).

To track lines, each line segment in the current frameis individually tracked using the standard KLT trackerlocated at the line segment’s center. The tracked linesegments are matched to the lines in the previous frame,and the current lines are associated to the lines in theprevious frame which contain most matched line seg-ments from the current frame. There may be apertureproblem in tracking line segments, i.e. tracked line seg-ments may move along the line, but the line segmentvoting step practically resolves this problem. We useNVidia’s CUDA framework intensively in feature de-tection and tracking to achieve faster processing speed.The current system runs at ~9 fps on a laptop (IntelCore2Duo 2.5 GHz and NVidia GeForce 8600M GT)with both line and point features, and ~20 fps withonly point features on the same machine (figure 8).

We compare performance in two very challengingenvironments. Given our simulation data, we contrasttrifocal quaternion-5 against the 3-point algorithm. Infigure 10, the sequence is collected around a parkingstructure. This is a large, open space and several frameshave either very little texture or point features are clus-

(a)

(b)

Fig. 9 Line detection and tracking process. (a) line segments aredetected from the seeds. (b) lines are detected by merging linesegments on the same line.

tered on one side. The 3-point algorithm performs no-ticeably worse, exhibiting a severe break (see near topof figure) and veering off from the loop. About 2036frames were captured for this sequence. In figure 11, weshow the result in a corridor sequence, where lack of suf-ficient texture in some frames, leads to very few pointsbeing tracked. The actual motion simply consists of awalk along with corridor, entrance into a staircase anda return. As is obvious from the estimated trajectory,the use of points and lines in the trifocal quaternion al-gorithm leads to much better performance. The 3-pointalgorithm drifts significantly into the wall. The centerplot in figure 11 shows a sideways view of the results.Since the stereo camera was fixed and transported on aplatform, the motion was more or less confined along aplane parallel to the ground. Ideally, the estimated cam-era path should lie on the thin horizontal line from thisperspective. The 3-point algorithm begins to drift fromthis very quickly, while the effect is minimized for thetrifocal quaternion-5 algorithm. In figure 11 bottom, aplot of the number of features available to both algo-rithms and RANSAC inliers found per frame is shown.This clearly shows the advantage of exploiting both fea-ture types in such situations.

Page 14: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

14

20m

Thursday, November 19, 2009

Fig. 10 Sequence for a loop around a parking structure (red: 3 point algorithm, green: trifocal quaternion-5)

7 Conclusion and Future Work

We have presented a robust algorithm for estimatingcamera motion for a stereo pair using combinationsof point and line features. As shown by our experi-ments, using such assorted feature sets leads to bettermotion estimates than state of the art visual odom-etry algorithms in real world scenarios. This helps inhandling low-textured regions, where the conventionalmethod of point-based odometry may fail. The quater-nion based solution yields significantly better motionestimates than the conventional linear/subspace meth-ods in the presence of noise and provides a robust mech-anism for extracting camera matrices from noisy trifo-cal tensors.

As part of future work, we plan to integrate the de-veloped visual odometry pipeline into a ‘hybrid’ visualSLAM system that builds a map over point and linelandmarks.

References

1. J. Neira, J. D. Tardos, J. Horn, and G. Schmidt, “Fusingrange and intensity images for mobile robot localization,”IEEE Transactions on Robotics and Automation, vol. 15,pp. 76 – 84, February 1999.

2. D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry,”IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR), vol. 1, pp. 652–659, 2004.

3. Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and H. S.Sawhney, “Ten-fold improvement in visual odometry usinglandmark matching,” International Conference on ComputerVision (ICCV), pp. 1 – 8, 2007.

4. D. Lowe, “Object recognition from local scale-invariant fea-tures,” Proceedings of the International Conference on Com-puter Vision, pp. 1150–1157, 1999.

5. M. Pollefeys, D. Nister, and et al., “Detailed real-time urban3d reconstruction from video,” IJCV, 2007.

6. S. Christy and R. Horaud, “Iterative pose computation fromline correspondences,” CVIU, vol. 73, no. 1, pp. 137–144,1999.

7. R. Hartley, “Lines and points in three views and the trifocaltensor,” IJCV, vol. 22, pp. 125–140, March 1997.

8. E. Rosten and T. Drummond, “Fusing points and lines forhigh performance tracking,” ICCV, vol. 2, pp. 1508–1515,October 2005.

9. M. A. Fischler and R. C. Bolles, “Random sample consensus:A paradigm for model fitting with applications to image anal-ysis and automated cartography,” International Journal ofComputer Vision (IJCV), vol. 22, no. 2, pp. 125–140, 1997.

10. V. Pradeep and J. Lim, “Egomotion using assorted features,”IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR), pp. 1514 – 1521, August2010.

11. D. Nister, “An efficient solution to the five-point relative poseproblem,” IEEE PAMI, vol. 26, no. 6, pp. 756–770, 2004.

12. R. Haralick, C. Lee, K. Ottenberg, and M. Nolle, “Analysisand solutions of the three point perspective pose estimation

Page 15: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

15

1m

Thursday, November 19, 2009

(a)

(b)

(c)

(d)

(e)

Fig. 11 Performance in low-textured, corridor sequence)

problem,” IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR), 1991.

13. M. Bujnak, Z. Kukelova, and T. Pajdla, “A general solutionto the p4p problem for camera with unknown focal length,”IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR), pp. 1–8, 2008.

14. B. Triggs, “Camera pose and calibration from 4 or 5 known3d points,” ICCV, pp. 278–284, 1999.

15. H. Stewénius, C. Engels, and D. Nister, “Recent develop-ments on direct relative orientation,” J. Photogrammetry andRemote Sensing, vol. 60, pp. 284–294, 2006.

16. Z. Kukelova, M. Bujnak, and T. Pajdla, “Polynomial eigen-value solutions to the 5-pt and 6-pt relative pose problems,”BMVC, 2008.

17. H. Li and R. Hartley, “Five-point motion estimation madeeasy,” ICPR, vol. 1, 2006.

18. A. Bartoli and P. Sturm, “Multiple-view structure and mo-tion from line correspondences,” ICCV, 2003.

19. Y. Liu and T. Huang, “A linear algorithm for motion estima-tion using straight line correspondences,” ICPR, pp. 213–219,1988.

20. P. H. S. Torr and A. Zisserman, “Robust parameterizationand computation of the trifocal tensor,” Image and VisionComputing, vol. 15, pp. 591–605, 1997.

21. A. Shashua and L. Wolf, “On the structure and properties ofthe quadrifocal tensor,” Lecture notes in computer science,pp. 710–724, 2000.

22. A. Comport, E. Malis, and P. Rives, “Accurate quadrifocaltracking for robust 3d visual odometry,” ICRA, pp. 40–45,2007.

23. R. Hartley, “Computation of the trifocal tensor,” ECCV,pp. 20–35, 1998.

24. A. Heyden, “Geometry and algebra of multiple projectivetransformations,” PhD thesis, Lund University, 1995.

25. M. Chandraker, J. Lim, and D. J. Kreigman, “Moving instereo: Efficient structure and motion using lines,” ICCV,2009.

26. S. Seitz and P. Anandan, “Implicit representation and scenereconstruction from probability density functions,” IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition (CVPR), vol. 2, 1999.

27. A. Ansar and K. Daniilidis, “Linear pose estimation frompoints or lines,” IEEE PAMI, vol. 25, no. 5, pp. 578–589,2003.

28. F. Dornaika and C. Garcia, “Pose estimation using pointand line correspondences,” Real-Time Imaging, vol. 5, no. 3,pp. 215–230, 1999.

29. J. Oliensis and M. Werman, “Structure from motion us-ing points, lines, and intensities,” IEEE Computer SocietyConference on Computer Vision and Pattern Recognition(CVPR), vol. 2, 2000.

30. R. I. Hartley and A. Zisserman, Multiple View Geometryin Computer Vision. Cambridge University Press, ISBN:0521623049, 2000.

31. Z. Zhang, “Determining the epipolar geometry and its uncer-tainty : A review,” International Journal of Computer Vision(IJCV), vol. 27, pp. 161 – 195, March 1998.

32. Z. Kukelova, M. Bujnak, and T. Pajdla, “Automatic genera-tor of minimal problem solvers,” ECCV, pp. 302–315, 2008.

33. B. K. P. Horn, “Closed-form solution of absolute orientationusing unit quaternions,” Journal of the Optical Society ofAmerica, vol. 4, pp. 629 – 642, 1987.

34. B. D. Lucas and T. Kanade, “An iterative image registra-tion technique with an application to stereo vision,” Interna-

Page 16: Egomotion Estimation Using Assorted Featuresvision.ucsd.edu/~jwlim/files/ijcv11triquat_pre.pdffeatures), as it exploits the known pose between the stereo pair in the quadrifocal tensor

16

tional Jount Conferences on Aritifical Intelligence (IJCAI),pp. 1151–1156, 1981.

35. R. G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall,“Lsd: A fast line segment detector with a false detection con-trol,” IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), vol. 32, pp. 722–732, April 2010.