Mapping with Depth Panoramas - stwing @ upennkainino/DepthPanorama.pdfinto a panoramic image...

Mapping with Depth Panoramas

Camillo J. Taylor, Anthony Cowley, Rafe Kettler, Kai Ninomiya, Mayank Gupta and Boyang Niu 1

Abstract— This work demonstrates the use of depth panora-mas in the construction of detailed 3D models of extendedenvironments. The paper describes an approach to the acqui-sition of such panoramic images using a robotic platform thatcollects sequences of depth images with a commodity depthsensor. These sequences are stitched into panoramic imagesthat efficiently capture scene information while reducing noisein the captured imagery. Scene structure is extracted from thepanoramas and used to register a collection of panoramas toobtain coverage of extended areas. The presented approachmaintains fine geometry detail in 3D reconstructions, usesa modest amount of memory, and enjoys excellent scalingcharacteristics across large environments.

I. INTRODUCTION

Depth sensors such as Microsoft’s Kinect and Kinect Onehave made it possible to rapidly collect detailed depth mapsthat evince fine geometric detail suitable for recognition andpose estimation of everyday objects [1]. This is a wonderfulstory for operating robots within small workspaces for whichdense 3D models may be maintained. An evolving story ishow this plays out as one considers ever larger work spaces.While much detail is available, reaping this whirlwind in-volves dealing with a huge amount of data collected overtime.

Consider a scenario in which an office building has beenscanned, is being monitored, or is in the process of beingscanned. One might request that a robot fetch a particularset of keys from somewhere in the building. To provide sucha capability, one must be able to work with an extendedwork space, yet still maintain sufficient geometric detail suchthat small objects may be identified during data perusal.Similar requirements apply to robotics applications that mustprovide motion planning for small vehicles through clutteredenvironments. Planning a route for a small aerial vehiclethrough an industrial environment requires both adeptness atboth small-scale precision and large-scale scope.

Such requirements argue against dense representations ofspace as it is difficult to scale such approaches to largeenvironments. Additionally, large environments are typicallyscanned over a non-negligible period of time, raising thequestion of how easily a representation may be updated,and if those updates may be done concurrently by multipleagents contributing to the same data set. While dense voxelrepresentations have been successfully employed [2], [3],

*This work was supported by the ARL RCTA program and the MicroAutonomous Systems and Technologies (MAST) CTA.

1 The authors are all with the GRASP Laboratory at the Univer-sity of Pennsylvania, Philadelphia PA, please address correspondence to:[email protected]

they tend to be constrained to small work spaces, resist up-dates upon revisiting a location, or, most commonly, involvea tradeoff between resolution and memory consumption.Sparse, tree-based voxel representations have been explored[4], but these do not easily admit concurrent updates, andtend to place somewhat coarse limits on spatial resolution.

In more recent work Whelan et al. [5] have demonstratedan impressive system where a global mesh representation,produced incrementally from a volumetric method can bemaintained and updated in real-time in the face of loopclosures. In this paper we propose an approach based on aset of local representations. Maintaining multiple local mapsallows us to easily handle situations where we observe detailsin the scene at different levels of resolution as we movethrough the environment.

Our representation is based upon the notion of a graphof depth keyframes as proposed by Meilland and Comport[6]. In this approach, a collection of depth images areregistered to each other to provide memory-efficient coverageof extended areas. While Meilland takes the opportunity ofcontinuous depth frame collection to improve the quality ofindividual keyframes, an active perception approach whereinthe collection of imagery drives a control input to the systemgathering the data permits enormous flexibility in workingwith virtual cameras for the purpose of keyframe creation.

We propose the use of 360◦ panoramic depth keyframescollected by a robotic platform. This approach makes gooduse of the slowest part of data acquisition: terrain traversal.It also better exploits the empty space surrounding everyvalid point from which imagery may be captured. Whileall depth images collapse free space along rays extendingfrom the capture device, a panorama reflects the fact that thecapture location is, by definition, completely surrounded byfree space.

II. TECHNICAL APPROACH

This section describes the system that we have imple-mented for acquiring, constructing, analyzing and visualizingdepth panoramas. The various stages in the pipeline aredescribed in more detail in the following subsections.

A. Constructing Depth Panoramas

In order to acquire the requisite measurements for ourdepth panorama we built the system shown in Figure 1.This robot consists of a Kinect One RGBD sensor mountedon top of an iRobot Create platform. With this system wewere able to systematically collect depth frames as the robotrotated and translated through the scene. It also allowed

us to correlate these RGBD images with the odometricmeasurements gleaned from the Create platform.

Fig. 1. A mobile robot equipped with a Kinect One depth camerawas used to acquire the raw data used to construct the panoramic depthrepresentations.

Given a sequence of depth images acquired as the platformrotates in place, our system is able to automatically fuse themtogether into a composite depth panorama as shown in Figure2.

The first phase in the panorama construction process is aframe to frame fusion step where successive images acquiredas the robot rotates in place are registered to each other.The process begins with a rough alignment stage where thesystem considers a range of possible yaw angles betweenthe two frames informed by the odometry measurementsand determines which yaw angle best aligns the two depthframes.

This estimate is used to initialize an Iterative ClosestPoint (ICP) algorithm which produces a refined estimate forthe rotation between the two frames in SO(3). The ICPprocedure consists of two phases, the first phase determinescorrespondences between the two frames based on an initialestimate of their relative pose, the second phase computesan updated estimate for the relative pose based on thosecorrespondences.

We make use of the structure of the depth map to simplifythe first phase of the ICP procedure. More specifically weuse our prevailing estimate for the displacement between thetwo frames to predict where the points viewed in the firstframe would appear in the second frame. We then search forcorrespondences in the vicinity of these projected locationsas opposed to searching through all of the points in thesecond image. This simplification exploits the fact that theinputs are depth images as opposed to general point clouds.

The optimization procedure makes use of point to planecorrespondences so in addition to finding a suitable corre-sponding point in the second image we also consider theestimated surface normal at that location.

This process of estimating the relative displacement be-tween successive depth images continues until all of theimages in the sequence have been localized. At that pointthe system considers all of the frames in the sequence andidentifies the point at which the view frustum returns to thefirst view in the sequence. The system then seeks to findthe best pairwise alignment between this closing image inthe sequence and the first image. This loop closure yieldsanother set of point to plane correspondences linking thefirst and final frames.

Note that despite our best efforts the first and final frameswill typically be slightly misaligned due to various biasand drift errors. This is corrected through a final globaloptimization step which considers all of the pairwise pointto plane correspondences and all of the camera poses si-multaneously to produce final pose estimates for each depthimage in the sequence. More specifically given a collectionof point to plane correspondences between adjacent imageswe minimize the objective function given in Equation 1which measures the sum of the squares of the point to planedistances

∑ijk

(nk

i ·[Rt

i(RjPkj + Tj)−Rt

iTi − P ki

])2(1)

In this expression Ri, Rj ∈ SO(3) and Ti, Tj ∈ R3

denote the rotations and translations associated with imagesi and j respectively. P k

i and P kj denote the position vectors

associated with corresponding surface elements and nki de-

notes the normal vector associated with the correspondencein frame i. In this optimization procedure the locations ofthe image centers are constrained to lie on a circle to modelthe geometric constraints imposed by our acquisition system.Note that this constraint simply acts as a prior and caneasily be relaxed or removed to allow for situations wherethe sensor is carried by hand or by a quadrotor. In ourexperiments each panorama was constructed from a sequenceof 30 depth images which were approximately evenly spaced.

Fig. 3. Final estimate for the relative position of the input frames after theglobal optimization stage. The red frustum corresponds to the first imagein the sequence.

Fig. 2. The top row shows 6 of the 30 images that are used to construct the panoramic depth image shown in the second row.

Figure 3 shows the final pose estimates derived from thisnon-linear optimization procedure. The red frustum denotesthe first frame in the sequence. As in other pose graphoptimization problems our system benefits from the factthat the Hessian of our objective function is quite sparsesince each term in the objective function only involves theinteraction of adjacent images in the sequence.

Once the relative positions of all of the depth imageshave been estimated, we employ a rasterization procedureto project the depth measurements from each of the imagesinto a panoramic image centered at the first frame as shownin Figure 2. Each pixel in this panoramic map corresponds toa ray defined by a pair of azimuth and elevation angles. Sincethe horizontal field of view of our depth camera is 70 degreesand the angular displacement between successive framesis approximately 12 degrees each ray in the panorama istypically viewed by several of the constituent depth images.In the panoramic image in Figure 4 each of the pixels iscolored by the number of depth images that contribute ameasurement to that pixel.

In fusing these measurements we consider the depth ofthe points to ensure that occlusions are handled properly andthen average the measurements that appear to emanate fromthe same surface geometry. This averaging process providesan opportunity to aggregate information and reduce noise asin Kinect Fusion and related methods [2]. It also provides anopportunity to produce superresolution images as Meillandand Comport do [6].

Normal vectors are estimated at each pixel in the panoramaby fitting a plane to neighboring pixels while taking care toexclude pixels that are at significantly different depths.

B. Analyzing Depth Panoramas

The comprehensive field of view afforded by thepanoramic depth images provides an ample context in whichto analyze scene structure. In this work we are principallyconcerned with indoor environments which are characterizedby flat floors and vertical, axis-aligned wall surfaces; ouranalysis seeks to exploit those properties.

The first phase of analysis seeks to refine our estimate forthe alignment of the panorama with respect to the gravityvector, g. Here we first identify pixels in the image whosenormals are approximately aligned with our nominal estimatefor the gravity vector, vi, we also consider the set of vectorsthat are approximately perpendicualar to the nominal gravityvector, wj .

Our goal is to select a gravity vector g that is aligned withthe vi vectors and perpendicular to the wj vectors. To do thiswe pose the following optimization problem:

maxg∈R3

∑i

(g · vi)2 −∑

j

(g · wj)2 (2)

This amounts to finding the eigenvector correspondingto the largest eigenvalue of the matrix A =

∑i(viv

ti) +∑

j(wjwtj).

Once this gravity vector has been recovered, the secondphase seeks to estimate the dominant rectilinear structureof the scene. This can be done by employing the entropycompass idea described in [7]. To do this we consider allof the points in the image whose normals are approximatelyperpendicular to the gravity direction, that is the points thatcorrespond to vertical surfaces. Figure 5 shows the projectionof these points onto the ground plane.

We then take this 2D point set, rotate it around the verticalaxis in 1 degree increments and for each such yaw anglewe compute the entropy of the X and Y coordinates. Whenwe plot the sum of these entropies as a function of anglewe notice the global minimum reliably corresponds to thedominant orientation of the structure. We extract this angleand use it to automatically produce axis aligned coordinatesas shown in Figure 5.

Figure 6 shows the surface normals produced after thisalignment process. In this depiction the color indicates theorientation of the underlying surface with green correspond-ing to the vertical axis and red and blue corresponding tothe two dominant horizontal axes.

Given this axis aligned representation we can carry outfurther analysis steps to identify salient planar surfaces in

Fig. 4. This figure indicates the number of depth samples that are rasterized to each pixel in the panoramic image. Brighter colors indicate more samples.

(a) (b) (c)

Fig. 5. (a) Shows an overhead view of the points in a panoramic map (b) Shows the result of an entropy analysis indicating that a rotation by 10 degreeswould produce an optimal alignment. (c) The result of applying the estimated rotation to the original points to produce an axis aligned variant.

Fig. 6. The upper image in this figure shows depicts the normals associated with each pixel after they have been transformed by an axis aligning rotation.The colors indicate the orientation, green for vertical, red and blue for the dominant rectilinear surfaces. The bottom image shows the output of a surfaceextraction procedure that extracts salient axis aligned surfaces in the scene. (This figure is best viewed in color)

the scene. Figure 6 shows the results of this analysis wheredifferent colors correspond to different extracted surfacecomponents. This stage of processing employs standardimage processing algorithms like morphological filtering andconnected components to help reduce error and to groupcoplanar surface elements. Effectively the panoramic depthimage representation allows us to use 2D image basedoperations to help understand the 3D structure of the scene.

C. Registration of Depth Panoramas

In addition to using the Iterative Closest Point procedureto align images within a panorama, we also use it to estimatethe relationship between panoramas.

In our experiment the robot was guided along a paththrough a laboratory complex and panoramas were capturedat various points along the trajectory. The robot odometryprovided an initial estimate for the relative position and ori-entation of the panoramas which was subsequently refined by

applying the Iterative Closest Point procedure to neighboringdepth panoramas. This refinement is also powerfully con-strained by the analysis described in the previous subsectionwhich accurately estimates the orientation of each framewith respect to a common axis aligned frame. Once againin this version of the ICP procedure corresponding surfaceelements are found by computing where points from onepanorama project into the other. The system then repeatedlyoptimizes the point to plane distances induced by thesecorrespondences.

In Figure 7 the red circles denote the recovered positionsof the panoramas in a global frame of reference. The bluedots represent observed points on vertical surfaces on thescene. The representation covers an area of approximaltely400 square meters.

More generally our representation can be viewed as agraph where the nodes correspond to panoramic frames andthe edges are annotated with estimates for the relative posebetween two frames. This model corresponds to the posegraph formulation embodied in systems like g2o [8]. Updatesinduced by loop closures can be handled by updating therelative pose estimates on the estimates without necessarilychanging the local maps. This allows us to maintain accuratelocal representations of the scene without enduring thepenalty of updating large scale mesh representations whenloop closures are detected.

Note that while it is possible to use the relative poseestimates in the graph to relate all of the local panoramicmaps into a common frame of reference as shown in Figure7 this is not required. This graphical representation where weencode the relative location of local maps provides enoughinformation to carry out tasks like navigation, planning andvisualization. In many situations accurate estimates of therelative location of nearby frames are preferable to represen-tations in a single global frame which can be distorted bydrift and other systematic errors.

D. Visualization

The previous sections describe an approach to constructingand registering a collection of depth panoramas of the scene.Taken together this collection of depth images is somewhatanalogous to a chart in differential geometry in that this setof local representations is used to cover an extended 3Denvironment.

Once this representation has been constructed it can beused to predict how the scene would appear from novelvantage points. This is done by considering all of thepanoramas that are sufficiently close to the new viewpointto contain relevant information. For each such panorama werasterize the recovered geometry into the novel view. Thisrasterization process properly accounts for occlusions, backface culling and field of view limitations. For each pixel inthe rendered view we can also compute a confidence valuewhich encodes the reliability that the panorama places onthat point. This reliability value is related to the distancebetween the rendered point and the panorama center and the

Fig. 8. This figure shows a collection of novel views of the sceneconstructed by our visualization procedure which fuses the results frommultiple panoramas to produce a final rendering.

Fig. 7. This figure shows an overhead view of the area explored by the robot. The locations of the panoramic frames are denoted by the red circles.

difference in viewing angle between the panorama and thenovel viewpoint.

These confidence values are used to fuse the views pro-duced from several panoramas into a final rendered result.This procedure effectively allows the system to combineinformation from several panoramas into a final represen-tation which takes advantage of the different perspectivesoffered by the different panoramas. Figure 8 shows samplesof rerenderings produced by this scheme. Notice the detailslike the objects on the table that are preserved in therepresentation.

III. CONCLUSIONS AND FUTURE WORK

In this paper we advocate the use of panoramic depthimages to effectively and efficiently represent extended en-vironments. Panoramic depth images support flexible 3Dreconstruction and geometry analysis tasks across large vol-umes of space, permit concurrent updates to different partsof the map, support simple streaming data usage models,and make modest demands of memory and storage. Theytake advantage of the observation that all of our depth im-agery is captured from locations fully surrounded by vacantspace whose representation is perfectly compressed by depthimages. The panoramas themselves represent excellent datafor scene analysis and object recognition tasks. They arecapable of representing the surrounding scene with sub-centimeter precision, and offer image structure that eases

analysis of captured 3D surfaces. The approach is compatiblewith pose graph based mapping schemes and allows us toaccurately represent local areas in the environment whileallowing for efficient updates of the relative poses of theframes as loop closures become available. We argue that thismethod provides an efficient foundation on which to basemany robotics or visualization tasks that involve extended3D environments.

A number of researchers including Matthies et al. [9] haveobserved that depth maps provide a good basis for planningcollision free paths for UAVs and other robots that need tomove through cluttered 3D environments. In future workwe plan to use our atlas of depth images to plan pathsthrough extended environments. Another promising directioninvolves exploiting the connection between the color anddepth images acquired by the Kinect sensor. These colorimages can provide feature correspondences which can beused to relate neighboring panoramas. They also providefurther context information which can be used to enhancethe analysis of the depth imagery or to infer missing depthvalues.

Tightly integrating visualization with data acquisition pro-vides an opportunity to make data-sensitive choices aboutthe representation. While the data used in this paper werecollected with an ad hoc keyframe selection procedure inwhich a robot’s trajectory is periodically interrupted forpanorama collection, one may take a more reasoned approach

Fig. 9. This figure shows a subset of the depth panoramas constructed from the robots tour of the office complex. Each depth panorama can be storedin less than 2.25 megabytes and the entire office complex can be represented in under 36 megabytes.

to determine where spatially dense sampling is called fordue to geometric complexity. This may be accomplishedby simulating denser panorama sampling along the robot’strajectory between two keyframe locations, and observinghow the amount of missing data in each panorama changes.For instance, if the robot collects a panorama from a positionon one side of an open doorway leading off of a corri-dor, then collects another panorama after moving down thecorridor beyond the doorway, a panorama synthesized froman intermediate position along the trajectory that includesviews through the open doorway will contain large regions

of missing data. This may be observed as a spike in theamount of missing data if one considers the sequence ofpanoramic views collected and synthesized along this trajec-tory segment.

REFERENCES

[1] A. Cowley, B. Cohen, W. Marshall, C. Taylor, and M. Likhachev, “Per-ception and motion planning for pick-and-place of dynamic objects,” inIntelligent Robots and Systems (IROS), 2013 IEEE/RSJ InternationalConference on, Nov 2013, pp. 816–823.

[2] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon,“Kinectfusion: Real-time dense surface mapping and tracking,” in

Proceedings of the 2011 10th IEEE International Symposium onMixed and Augmented Reality, ser. ISMAR ’11. Washington,DC, USA: IEEE Computer Society, 2011, pp. 127–136. [Online].Available: http://dx.doi.org/10.1109/ISMAR.2011.6092378

[3] T. Whelan, J. McDonald, M. Kaess, M. Fallon, H. Johannsson,and J. Leonard, “Kintinuous: Spatially extended KinectFusion,”in RSS Workshop on RGB-D: Advanced Reasoning with DepthCameras, Sydney, Australia, Jul 2012. [Online]. Available: http://www.cs.nuim.ie/research/vision/data/rgbd2012/citation.html

[4] J. Chen, D. Bautembach, and S. Izadi, “Scalable real-timevolumetric surface reconstruction,” ACM Trans. Graph., vol. 32,no. 4, pp. 113:1–113:16, July 2013. [Online]. Available: http://doi.acm.org/10.1145/2461912.2461940

[5] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard,and J. Mcdonald, “Real-time large-scale dense rgb-d slam withvolumetric fusion,” International Journal of Robotics Research,vol. 34, no. 4-5, pp. 598–626, Apr. 2015. [Online]. Available:http://dx.doi.org/10.1177/0278364914551008

[6] M. Meilland and A. Comport, “On unifying key-frame andvoxel-based dense visual SLAM at large scales,” in InternationalConference on Intelligent Robots and Systems. Tokyo, Japan:IEEE/RSJ, 3-8 November 2013. [Online]. Available: http://www.i3s.unice.fr/∼comport/publications/2013 IROS Meilland.pdf

[7] A. Cowley, C. J. Taylor, and B. Southall, “Rapid multi-robot explo-ration with topometric maps,” in Proc. IEEE International Conferenceon Robotics and Automation (ICRA), 2010.

[8] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,“g 2 o: A general framework for graph optimization,” in Robotics andAutomation (ICRA), 2011 IEEE International Conference on. IEEE,2011, pp. 3607–3613.

[9] L. Matthies, R. Brockers, Y. Kuwata, and S. Weiss, “Stereo vision-based obstacle avoidance for micro air vehicles using disparity space,”in Robotics and Automation (ICRA), 2014 IEEE International Confer-ence on, May 2014, pp. 3242–3249.

[10] R. Szeliski and H.-Y. Shum, “Creating full view panoramic imagemosaics and environment maps,” in Computer Graphics (SIG-GRAPH’97 Proceedings). Los Angeles: Association for Computing

Machinery, Inc., August 1997, pp. 251–258. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=75673

Mapping with Depth Panoramas - stwing @ upennkainino/DepthPanorama.pdfinto a panoramic image...

Documents

Transcript of Mapping with Depth Panoramas - stwing @ upennkainino/DepthPanorama.pdfinto a panoramic image...