XIII Simp osio Brasileiro de Automac a~o Inteligente Porto ... · Kazik et al. (2012) computes the...

VISUAL ODOMETRY WITH AN OMNIDIRECTIONAL MULTI-CAMERA SYSTEM

Ana Rita Pereira∗, Valdir Grassi Junior∗, Helder Araujo†

∗Sao Carlos School of Engineering, University of Sao Paulo, Sao Carlos, SP, Brazil

†Institute for Systems and Robotics, University of Coimbra, Coimbra, Portugal

Emails: [email protected], [email protected], [email protected]

Abstract— This paper suggests a visual odometry approach for an omnidirectional multi-camera system.Monocular visual odometry is carried out for each camera and the best estimate is selected through a votingscheme. While this approach does not present significant improvements in the trajectory in the XY -plane, itoutperforms any of the monocular odometries in the relative translation in Z.

Keywords— Visual odometry, multi-camera system, omnidirectional vision.

1 Introduction

Autonomous navigation has been a relevant re-search topic for a long time, leading to evidentprogress in the last years. One of the many ap-proached topics is visual odometry, which aims atretrieving the relative motion of one or more cam-eras based solely on their images.

Visual odometry with one camera or a stereopair facing the same part of the scene has a lim-ited field of view. Of the 360o surroundings, only asmall part is used for estimating the camera’s egomotion, potentially ignoring parts of the scenariowhich are rich in features and could provide a bet-ter motion estimate. In this paper we propose avisual odometry approach for an omnidirectionalmulti-camera system so that the whole surround-ings of the scene can be explored.

A lot of work has been done on visualodometry with multiple cameras. Kim et al.(2007) determines the ego-motion of a set of non-overlapping cameras. Visual odometry is com-puted for each camera through the 5-Point algo-rithm and all estimates are averaged. If the mo-tion contains rotation, the translation scale can beretrieved. Clipp et al. (2008) proposes a methodfor solving the motion of two non-overlappingcameras mounted on the roof of a vehicle, requir-ing only five correspondences in the first cameraand one in the second. The 5-Point algorithm iscarried out in the first camera, retrieving its ro-tation and translation up to a scale factor. Usingone correspondence in the second camera, and aslong as motion has enough rotation, scale is re-trieved. This approach can be expanded for alarger number of cameras. Kazik et al. (2012)computes the 6 DoF motion of two opposite-facingcameras in real-time and challenging indoor en-vironments. Monocular visual odometry is com-puted separately for each camera and, in everykey-frame, bundle adjustment is performed, mini-mizing the reprojection error. The individual mo-tion estimates are fused in an Extended KalmanFilter (EKF). The known transform between both

cameras allows retrieving absolute scale as longas motion is not degenerate. Degenerate motionsare detected and discarded, preventing from prop-agating wrong scale estimations. Lee et al. (2013)computes the non-holonomic motion of a vehi-cle with four cameras mounted on the roof. Theauthors simplify the Generalized Epipolar Con-straint applying the Ackerman steering principleand allowing motion to be computed with twopoints. They also propose a method which allowsretrieving scale in degenerate motions, requiringone additional inter-camera correspondence. Mo-tion is computed in a Kalman filter framework,where velocity is assumed to be constant. Loop-closure and bundle adjustment are carried out.

The omnidirectional system used in this workis the spherical multi-camera system Ladybug2 byFLIR. This sensor consists of six cameras; five dis-posed as a ring, looking outwards, and the sixthis looking up. Only the five lateral cameras areconsidered here. The distance between camerasis relatively short, and the lack of a large base-line prevents an accurate depth estimation. Theproposed approach for visual odometry consistsof two main steps: in the first step, a monocularvisual odometry algorithm is applied to each cam-era. The outputs of the first step are five relativemotion estimates, obtained from a set of previousand current images of the five cameras. In thesecond step, feature matches found in each cam-era are used to test each of the five motion esti-mates - if a feature match agrees with the motionestimate being tested, it is considered an inlier.The motion estimate with the most inliers is se-lected. This process is repeated everytime a newframe arrives, leading to a chain of voted motionestimates. A general overview of the proposed al-gorithm is depicted in Figure 1.

The remainder of this paper is as follows: sec-tion 2 explains each step of the proposed method.The experimental procedure and results are de-scribed in section 3 and section 4 adds closing re-marks and future work suggestions.

XIII Simposio Brasileiro de Automacao Inteligente

Porto Alegre – RS, 1o – 4 de Outubro de 2017

ISSN 2175 8905 307

Figure 1: Summarized flow chart of the proposedapproach.

2 Method overview

In this section each phase of the proposed methodis detailed, followed by a note on the transla-tion scale. The code was implemented based ona few existing works and using ROS, Eigen andOpenCV.

2.1 Monocular visual odometry

Each image from the Ladybug2 sensor is convertedto gray scale and rectified. The first step in the vi-sual odometry pipeline is detecting and trackingfeatures. FAST (Rosten and Drummond, 2006)features are extracted and tracked over subsequentimages using the Lucas-Kanade method (BruceD. Lucas, 1981). Once the number of inliers in theimage drops under 50, a new FAST detection istriggered. The code from (Singh, 2015) was used,who uses the available functions in OpenCV 3.Inter-camera features were initially taken into ac-count. However, they were later dropped out be-cause they considerably increased execution time,while having little influence on the results. Be-fore actual motion computation, the obtained 2Dpoints are normalized: their centroid is set to zeroand the points are multiplied by a scale factor suchthat the average distance to the origin is

√2.

In a RANSAC (Fischler and Bolles, 1981)framework, the 8-Point algorithm (Hartley, 1997)is carried out: In each RANSAC interaction, eightrandom matches are selected and used for comput-ing the fundamental matrix F . For this, the 8× 9matrix A is formed out of the eight points, whereeach line takes the following form:

Ai =

xi,t+1 · xi,txi,t+1 · yi,txi,t+1

yi,t+1 · xi,tyi,t+1 · yi,tyi,t+1

xi,tyi,t1

T

(1)

for each pair of corresponding points pi = [xi, yi]T

in consecutive images taken at instants t and t+1.The SVD of A is computed and F is obtained fromthe elements of the eigenvector corresponding tothe lowest eigenvalue (Trucco and Verri, 1998).

Finally, the rank-2 constraint is imposed. After-wards, F is tested against all points, by calculat-ing the Sampson cost, defined as

dS =pTi,t+1F pi,t

(F pi,t)2x + (F pi,t)2y + (FT pi,t+1)2x + (FT pi,t+1)2y(2)

where the tilde indicates homogeneous coordi-nates and the subscripts x and y refer to the firstand second components of the resulting vector.For a given point in the previous image, this cost isthe distance from the actual corresponding pointto the epipolar line given by F in the current im-age. Points are considered inliers if this distanceis below a threshold. This process is repeated 400times, and the F with the most inliers is after-wards refined using all agreeing points.

The essential matrix E is computed out of Fas shown below:

E = KT · F ·K, (3)

where K is the camera matrix with the intrinsicparameters. The rank-2 constraint is also appliedto E, before decomposing it into a rotation matrixR and a translation vector t. Assuming matrices

W =

0 −1 01 0 00 0 1

and Z =

0 1 0−1 0 10 0 0

, (4)

rotation matrix candidates Ra and Rb and theskew symmetric translation matrix S are com-puted as follows:

Ra = U ·W ·V T , Rb = U ·WT ·V T , S = U ·Z ·UT ,(5)

where matrices U and V are the outputs of theSVD of E. If the determinant of Ra (or Rb) isnegative, the matrix must be multiplied by −1.As to translation vector t, its components are ex-tracted from S. After decomposing E, two possi-ble rotations (Ra and Rb) and two possible trans-lations (t and −t) are obtained, leading to fourpossible transforms. Each combination is sub-ject to a chirality test, where points are triangu-lated and checked whether they lie in front of bothprevious and current cameras. The combinationwith the most valid points is selected. This imple-mentation of the 8-Point algorithm was based on(Geiger, 2010) for the monocular case.

2.2 Voting scheme

The previous step outputs five motion estimates,obtained from the features found in each camera.These estimates undergo a voting scheme, whereeach camera n votes for each transform m. Thefollowing steps are carried out: First, as the trans-form returned in the monocular visual odometrystep for camera m is expressed in camera m coor-dinates, it must first be converted to the cameran reference frame. After extracting R and t of



308

the obtained transform, the essential matrix E iscomputed.

E = S ·R (6)

where S is the skew symmetric matrix of the trans-lation vector. Matrix F is then composed by

F = K−T · E ·K−1, (7)

followed by a rank-2 re-enforcement. Finally, theSampson cost is computed for all features foundin camera n, obtaining a number of inliers - theseare the votes of camera n to camera m’s motionestimate. The votes are summed up for all n’s andthe estimate m with the most votes is selected,being the final output of the multi-camera visualodometry algorithm.

2.3 Translation scale

Since we are using a multi-camera system withknown extrinsic parameters, one could assumethat scale could be retrieved, as many authorshave done. However, due to the small distancebetween the cameras, contrasting with the largedistance between sensor and scene points, the La-dybug2 system acts almost like a central sensor.In other words, the optical centers of all camerascan be approximated to be at the same place and,without a prominent baseline, scale cannot be reli-ably retrieved. In the graphs showing the results,the translation vector resulting from visual odom-etry has been scaled such that its norm is equal tothe ground truth translation norm at correspond-ing instants.

3 Experiments and results

This section is dedicated to this project’s experi-ments and results. The experimental procedure isfirst described. Stepping into the results, the esti-mated trajectories of the five monocular odometryestimates and the voted camera estimate are pre-sented. Afterwards, accuracy is studied in greaterdetail by analyzing graphs of the error of eachcomponent of the relative motion. Finally exe-cution time is discussed. Color code of plots andgraphs is as follows: Ground truth is representedin black, the voted odometry sequence in red andthe monocular odometries of cameras 0, 1, 2, 3and 4 in yellow, gray, blue, green and magenta,respectively.

3.1 Experimental procedure

Experiments were carried out by driving a vehi-cle along a path of approximately 1.6 km at amean speed of 27.4 km/h, in an environment withfew buildings and mainly composed by vegeta-tion. The vehicle is part of an autonomous carproject, Carina2, led by ICMC, USP. The vehicle

was equipped with a GPS, an IMU and the Lady-bug2 system, fixed on a pole on the roof. Alongthis text, the individual cameras of Ladybug2 arerefered according to Figure 2, right, where camera0 points at the front of the vehicle. Ground truthis composed by the GPS positions and IMU’s rolland pitch. The yaw is obtained by computing theangle of the trajectory in the XY -plane. The GPShas RTK correction and a position precision of 0.4m (horizontally) and 0.8 m (vertically). The fram-erate of the Ladybug2 sensor is 6 Hz and intrinsicand extrinsic calibration is provided by FLIR.

Figure 2: Left: Carina2. Right: Schematic draw-ing and reference frame of the Ladybug2 sensor.

3.2 Trajectories

Figure 3 shows the monocular visual odometry es-timates obtained from each camera and the tra-jectory formed by the voted estimates in the XY -plane, i.e., as seen from above. None of the es-timates is very accurate, but, visually guessing,estimates from cameras 2, 3, as well as the votedodometry are the ones that most resemble the realtrajectory. Regarding the odometry of the bestcameras, the initial straight line drifts towardsnegative Y and some of the curves are estimatedto be less tight than reality.

Figure 4 shows the same trajectories as thefigure above, but in the XZ-plane, providing avertical view. While cameras 0, 1 and 4 drift to-wards negative Z, cameras 2 and 3 tend to drift up.The voted odometry is the one that stays closerto Z = 0, keeping the most realistic height.

The sequence of the best (voted) odometryestimates is again shown in Figure 5, where thebest camera in a particular instant is shown by amarker of the corresponding color. Although sometendencies are observed along different parts ofthe trajectory, all cameras are, from time to time,voted as the best camera.

3.3 Relative motion errors

This section compares the relative motion errorsof the voted odometry towards each monocularestimate. The relative motion error is computedfor each component of the trajectory (translationin X, Y and Z and rotation about X, Y and Z)



309

Figure 3: Monocular visual odometry estimates (camera 0 to 4). Sequence of voted estimates (bestcamera). XY -plane.

Figure 4: Monocular visual odometry estimates (camera 0 to 4). Sequence of voted estimates (bestcamera). XZ-plane.

Figure 5: Left: Sequence of best cameras. Right: Detail.

as follows:

∆Err = (αvo,t+1 − αvo,t)− (αgt,t+1 − αgt,t), (8)

where α can be any of the mentioned motion com-ponents. gt stands for ground truth and vo for



310

visual odometry. As relative rotation errors aresimilar for every odometry, only relative transla-tion errors are presented.

To facilitate the temporal correspondence ofthe error graphs and the trajectory, some pointson the path are marked and named, both in theground truth trajectory (Figure 6) and in the errorgraphs.

Figure 6: Ground truth trajectory with markers.

Figures 7, 8, 9, 10 and 11 show the relativetranslation errors in X, Y and Z to the groundtruth. The errors of the voted odometry are showntogether with the errors of each of the monocu-lar odometries in separate plots. Although thetrajectory plots suggest that the voted odometryis the most similar to the ground truth, its rela-tive X and Y translation estimates are, in gen-eral, outperformed by the monocular odometries,especially by the ones provided by cameras 0, 2and 3. For instance, the relative translation er-ror in X and Y for camera 3 stays around a lowvalue with few large variations between markersG and I. In that same interval the voted odome-try varies frequently around positive values, visi-bly larger than the one of camera 3. For cameras1 and 4, monocular visual odometry is not signifi-cantly more accurate than the voted odometry inXY , which can be exemplified with camera 4’s er-ror, between markers D and E. The error in Z isconsiderably smaller for the voted odometry, com-paring to any monocular estimate. This is, for in-stance, visible for camera 0, between markers Gand H, and camera 3, between markers B and C.Overall, the relative motion errors change in signand magnitude along the trajectory. This behav-ior could be due to changes in the environment,at the different parts of the path.

3.4 Execution time

The program ran on a Dual-Core i5-3317U laptop.Processing of most frames does not take longerthan 0.5 seconds. This means that the bagfile con-taining the 6 Hz image sequence collected on theroad had to be played at 0.3 times the originalspeed. Therefore, this implementation is not suit-able for real-time applications, with the currenthardware.

Figure 7: Errors for relative translation in X, Y ,Z. Voted odometry (red) and camera 0 (yellow).

Figure 8: Errors for relative translation in X, Y ,Z. Voted odometry (red) and camera 1 (grey).

Figure 9: Errors for relative translation in X, Y ,Z. Voted odometry (red) and camera 2 (blue).

4 Conclusion

This paper proposes a two-step visual odometrymethod for an omnidirectional multi-camera sys-tem. In a first step, monocular visual odometryis carried out for each camera and, in a secondphase, the best estimate is selected through a vot-



311

Figure 10: Errors for relative translation in X, Y ,Z. Voted odometry (red) and camera 3 (green).

Figure 11: Errors for relative translation in X,Y , Z. Voted odometry (red) and camera 4 (ma-genta).

ing scheme. Here, the feature matches of all cam-eras test (or ”vote for”) each motion estimate ob-tained in the first step. Regarding the trajecto-ries, the voted odometry does not drift verticallyas much as any of the monocular estimates. Thisresult is confirmed by the low relative translationerror in Z.

The proposed method can be further im-proved and some ideas are left as future work. Inorder to overcome the low framerate of the sensor,more robust feature extraction and tracking tech-niques can be used. Furthermore, better state-of-art monocular visual odometry algorithms canbe explored. Execution time can be decreasedby down-scaling images before processing, usinga faster feature extraction method or consideringheuristics to facilitate feature matching.

Acknowledgments

The authors would like to thank CNPq for thescholarship given to the first author. Furthermore,special thanks go to the LRM (Mobile Robotics

Laboratory) ICMC/USP for providing Carina2and Ladybug2 sensor for the experiments of thiswork.

References

Bruce D. Lucas, T. K. (1981). An iterative imageregistration technique with an application tostereo vision, Proceedings of Imaging Under-standing Workshop, pp. 121–130.

Clipp, B., Kim, J., Frahm, J., Pollefeys, M. andHartley, R. I. (2008). Robust 6dof motionestimation for non-overlapping, multi-camerasystems, 9th IEEE Workshop on Applicationsof Computer Vision, pp. 1–8.

Fischler, M. A. and Bolles, R. C. (1981). Randomsample consensus: A paradigm for modelfitting with applications to image analysisand automated cartography, Commun. ACMpp. 381–395.

Geiger, A. (2010). Libviso2: C++ library for vi-sual odometry 2 (http://www.cvlibs.net/software/libviso/). Software information.

Hartley, R. (1997). In defence of the 8-point algo-rithm, IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, pp. 580–593.

Kazik, T., Kneip, L., Nikolic, J., Pollefeys, M. andSiegwart, R. (2012). Real-time 6d stereo vi-sual odometry with non-overlapping fields ofview, Computer Vision and Pattern Recog-nition (CVPR), 2012 IEEE Conference on,pp. 1529–1536.

Kim, J.-H., Hartley, R., Frahm, J.-M. and Polle-feys, M. (2007). Visual odometry for non-overlapping views using second-order coneprogramming, Computer Vision – ACCV2007, pp. 353–362.

Lee, G. H., Faundorfer, F. and Pollefeys, M.(2013). Motion estimation for self-drivingcars with a generalized camera, ComputerVision and Pattern Recognition (CVPR),2013 IEEE Conference on, pp. 2746–2753.

Rosten, E. and Drummond, T. (2006). Machinelearning for high speed corner detection, 9thEuropean Conference on Computer Vision,pp. 430–443.

Singh, A. (2015). An opencv based implementa-tion of monocular visual odometry (https://github.com/avisingh599/mono-vo).Github repository.

Trucco, E. and Verri, A. (1998). IntroductoryTechniques for 3-D Computer Vision.



312

XIII Simp osio Brasileiro de Automac a~o Inteligente Porto ... · Kazik et al. (2012) computes the...

Documents

Transcript of XIII Simp osio Brasileiro de Automac a~o Inteligente Porto ... · Kazik et al. (2012) computes the...