A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

19
International Journal of Computer Vision 68(2), 125–143, 2006 c 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands. DOI: 10.1007/s11263-006-6660-3 A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion JOSH WILLS, SAMEER AGARWAL AND SERGE BELONGIE Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA [email protected] [email protected] [email protected] Received August 4, 2004; Revised October 20, 2005; Accepted November 16, 2005 First online version published in March, 2006 Abstract. We present a novel framework for motion segmentation that combines the concepts of layer-based methods and feature-based motion estimation. We estimate the initial correspondences by comparing vectors of filter outputs at interest points, from which we compute candidate scene relations via random sampling of minimal subsets of correspondences. We achieve a dense, piecewise smooth assignment of pixels to motion layers using a fast approximate graphcut algorithm based on a Markov random field formulation. We demonstrate our approach on image pairs containing large inter-frame motion and partial occlusion. The approach is efficient and it successfully segments scenes with inter-frame disparities previously beyond the scope of layer-based motion segmentation methods. We also present an extension that accounts for the case of non-planar motion, in which we use our planar motion segmentation results as an initialization for a regularized Thin Plate Spline fit. In addition, we present applications of our method to automatic object removal and to structure from motion. Keywords: motion segmentation, RANSAC, Markov Random Field, layer-based motion, metric labeling problem, graph cuts, periodic motion 1. Introduction Consider the pair of images shown in Fig. 1. These two images were captured by an aquarium webcam on a pan-tilt head. For a human observer, a brief examination of the images reveals what happened from one frame to the next: the lower fish swam down and darted forward and the upper fish moved forward slightly; meanwhile, the camera panned to the left about a third of the image width. Even without color information, this is a sim- ple task for the human visual system. The same cannot be said for any existing computer vision system. What makes this problem difficult from a computational per- spective? There are number of complicating factors, including the following: (1) due to the low frame rate, the motion between frames is a significant fraction of the image size, (2) the moving objects are relatively small and have few features compared to the richly textured background, (3) the poses of the fish change as they swim, (4) because of the panning motion of the camera, the second frame has motion blur. Finding out what went where in two frames of an image sequence is an instance of the motion segmen- tation problem. Formally, motion segmentation con- sists of (1) finding groups of pixels in two or more frames that move together, and (2) recovering the mo- tion fields associated with each group. Motion segmen- tation has wide applicability in areas such as video coding, content-based video retrieval, and mosaick- ing. In its full generality, the problem cannot be solved since infinitely many constituent motions can explain the changes from one frame to another. Fortunately,

Transcript of A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Page 1: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

International Journal of Computer Vision 68(2), 125–143, 2006

c© 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands.

DOI: 10.1007/s11263-006-6660-3

A Feature-based Approach for Dense Segmentation and Estimation of LargeDisparity Motion

JOSH WILLS, SAMEER AGARWAL AND SERGE BELONGIEDepartment of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA

[email protected]

[email protected]

[email protected]

Received August 4, 2004; Revised October 20, 2005; Accepted November 16, 2005

First online version published in March, 2006

Abstract. We present a novel framework for motion segmentation that combines the concepts of layer-basedmethods and feature-based motion estimation. We estimate the initial correspondences by comparing vectors offilter outputs at interest points, from which we compute candidate scene relations via random sampling of minimalsubsets of correspondences. We achieve a dense, piecewise smooth assignment of pixels to motion layers using afast approximate graphcut algorithm based on a Markov random field formulation. We demonstrate our approach onimage pairs containing large inter-frame motion and partial occlusion. The approach is efficient and it successfullysegments scenes with inter-frame disparities previously beyond the scope of layer-based motion segmentationmethods. We also present an extension that accounts for the case of non-planar motion, in which we use our planarmotion segmentation results as an initialization for a regularized Thin Plate Spline fit. In addition, we presentapplications of our method to automatic object removal and to structure from motion.

Keywords: motion segmentation, RANSAC, Markov Random Field, layer-based motion, metric labeling problem,graph cuts, periodic motion

1. Introduction

Consider the pair of images shown in Fig. 1. These twoimages were captured by an aquarium webcam on apan-tilt head. For a human observer, a brief examinationof the images reveals what happened from one frame tothe next: the lower fish swam down and darted forwardand the upper fish moved forward slightly; meanwhile,the camera panned to the left about a third of the imagewidth. Even without color information, this is a sim-ple task for the human visual system. The same cannotbe said for any existing computer vision system. Whatmakes this problem difficult from a computational per-spective? There are number of complicating factors,including the following: (1) due to the low frame rate,the motion between frames is a significant fraction of

the image size, (2) the moving objects are relativelysmall and have few features compared to the richlytextured background, (3) the poses of the fish changeas they swim, (4) because of the panning motion of thecamera, the second frame has motion blur.

Finding out what went where in two frames of animage sequence is an instance of the motion segmen-tation problem. Formally, motion segmentation con-sists of (1) finding groups of pixels in two or moreframes that move together, and (2) recovering the mo-tion fields associated with each group. Motion segmen-tation has wide applicability in areas such as videocoding, content-based video retrieval, and mosaick-ing. In its full generality, the problem cannot be solvedsince infinitely many constituent motions can explainthe changes from one frame to another. Fortunately,

Page 2: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

126 Wills, Agarwal and Belongie

Figure 1. Two consecutive frames from a saltwater aquarium webcam.

in real scenes the problem is simplified by the obser-vation that objects are usually composed of spatiallycontiguous regions and the number of independent mo-tions is significantly smaller than the number of pix-els. Operating under these assumptions, we proposea new motion segmentation algorithm for scenes con-taining objects with large inter-frame motion. The algo-rithm leverages and builds upon established techniquesfor robust estimation of motion fields and discontinu-ity preserving smoothing in a novel combination thatdelivers the first dense, layer-based motion segmen-tation method for the case of large (non-differential)motions.

The structure of the paper is as follows. We will be-gin in Section 2 with an overview of related work. InSection 3, we detail the components of our approach tothe problem of motion segmentation. This the primarycontribution of this work. We present experimental re-sults for this approach in Section 4. An extension tonon-planar motion is presented in Section 5. Applica-tions of our method are presented in Section 6. Thepaper concludes in Section 8.

2. Related Works

Early approaches to motion segmentation were basedon estimating dense optical flow. The optical flow fieldwas assumed to be piecewise smooth to account fordiscontinuities due to occlusion and object boundaries(see for example Ayer and Sawhney, 1995; Black andJepson, 1996; Odobez and Bouthemy, 1998). Darrelland Pentland (1991) and Wang and Adelson (1993) in-troduced the idea of decomposing the image sequenceinto multiple overlapping layers, where each layer isa smooth motion field. Weiss (1997) extended this ap-proach to account for flexible motion fields using reg-ularized radial basis functions (RBFs).

Optical flow based methods are limited in their abil-ity to handle large inter-frame motion or objects with

overlapping motion fields. Coarse-to-fine methods areable to solve the problem of large motion to a cer-tain extent (see for example, Szeliski and Coughlan,1994; Szeliski and Shum, 1996), but the degree ofsub-sampling required to make the motion differen-tial places an upper bound on the maximum allowablemotion between two frames and limits it to about 15%of the dimensions of the image (Irani and Anandan,1999). Also in cases where the order of objects alongany line in the scene is reversed and their motion fieldsoverlap, the coarse to fine processing ends up blurringthe two motions into a single motion before optical flowcan be calculated.

In this paper we are interested in the case of dis-crete motion, i.e. where optical flow based methodsbreak down. Most closely related to our work is thatof Torr et al. (1998). Torr uses sparse correspondencesobtained by running a feature detector and matchingthem using normalized cross correlation. He then pro-cesses the correspondences in a RANSAC frameworkto sequentially cover the the set of motions in the scene.Each iteration of his algorithm finds the dominant mo-tion model that best explains the data and is simplestaccording to a complexity measure. The set of modelsand the associated correspondences are then used as theinitial guess for the estimation of a mixture model usingthe Expectation Maximization (EM) algorithm. Spuri-ous models are pruned and the resulting segmentationis smoothed using morphological operations.

In a more recent work (Torr et al., 1999), the au-thors extend the model to 3D layers in which pointsin the layer have an associated disparity. This allowsfor scenes in which the planarity assumption is vio-lated and/or a significant amount of parallax is present.The pixel correspondences are found using a multiscaledifferential optical flow algorithm, from which the lay-ers are estimated in a Bayesian framework using EM.Piecewise smoothness is ensured by using a Markovrandom field prior.

Page 3: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 127

Neither of the above works demonstrate the abilityto perform dense motion segmentation on a pair of im-ages with large inter-frame motion. In both of the aboveworks the grouping is performed in a Bayesian frame-work. While the formulation is optimal and strong re-sults can be proved about the optimality of the Maxi-mum Likelihood solution, actually solving for it is anextremely hard non-linear optimization problem. Theuse of EM only guarantees a locally optimal solutionand says nothing about the quality of the solution. Asthe authors point out, the key to getting a good segmen-tation using their algorithm is to start with a good guessof the solution and they devote a significant amount ofeffort to finding such a guess. However it is not clearfrom their result how much the EM algorithm improvesupon their initial solution.

A representative work addressing the case of non-planar motion is Torr et al. (1995), which shows howthe trifocal tensor can be used to cluster groups ofsparse correspondences that move coherently acrossthree views. That work deals with similar types of se-quences as those in our work, but it does not providedense assignment to motion layers or dense opticalflow. The paper states that it is an initialization andthat more work is needed to provide a dense segmenta-tion, however the extension of dense stereo assignmentto multiple independent motions is certainly non-trivialand there is yet to be a published solution. In addition,this approach is not applicable for objects with non-rigid motion, as the fundamental matrix and trifocaltensor apply only to rigid motion.

3. Proposed Method

3.1. Our Approach

Our approach is based on a two stage process, the firstof which is responsible for motion field estimation andthe second of which is responsible for motion layerassignment. As a preliminary step we detect interestpoints in the two images and match them by comparingfilter responses. We then use a RANSAC based proce-dure for detecting the motion fields relating the frames.Based on the detected motion fields, the correspon-dences detected in the first stage are partitioned intogroups corresponding to each constituent motion fieldand the resulting motion fields are re-estimated. Finally,we use a fast approximate graph cut based method todensely assign pixels to their respective motion fields.We now describe each of these steps in detail. A ref-

erence Matlab implementation of the steps describedin this section is available for download at http://vision.ucsd.edu/motion seg.html.

3.1.1. Interest Point Detection and Matching. Manypixels in real images are redundant so it is beneficialto find a set of points that reduce some of this redun-dancy. To achieve this, we detect interest points usingthe Forstner operator (Forstner and Gulch, 1987). Todescribe each interest point, we apply a set of 76 filters(3 scales and 12 orientations with even and odd phaseand an elongation ratio of 3:1, plus 4 spot filters) toeach image. The filters, which are at most 31 × 31 pix-els in size, are evenly spaced in orientation at intervalsof 15◦ and the changes in scale are half octave. For eachof the scales and orientations, we use a quadrature pairof derivative-of-Gaussian filters corresponding to edgeand bar-detectors respectively, as in Jones and Malik(1992) and Freeman and Adelson (1991).

To obtain some degree of rotational invariance, thefilter response vectors may be reordered so that the or-der of orientations is cyclically shifted. This is equiv-alent to filtering a rotated version of the image patchthat is within the support of the filter. We perform threesuch rotations in each direction to obtain rotational in-variance up to ±45◦.

We find correspondences by comparing filter re-sponse vectors using the L1 distance. We compare eachinterest point in the first image to those in the secondimage and assign correspondence between points withminimal error. Since matching is difficult for imagepairs with large inter-frame disparity, the remainder ofour approach must take into account that the estimatedcorrespondences can be extremely noisy.

3.1.2. Estimating Motion Fields. Robust estimationmethods such as RANSAC (Fischler and Bolles, 1981)have been shown to provide very good results in thepresence of noise when estimating a single, globaltransformation between images. Why can’t we sim-ply apply these methods to multiple motions directly?It turns out that this is not as straightforward as onemight imagine. Methods in this vein work by itera-tively repeating the estimation process where each timea dominant motion is detected, all correspondences thatare deemed inliers for this motion are removed (Torr,1998).

There are a number of issues that need to be ad-dressed before RANSAC can be used for the pur-pose of detecting and estimating multiple motions. The

Page 4: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

128 Wills, Agarwal and Belongie

Figure 2. Phantom motion fields. (Row 1) Scene that consists of two

squares translating away from each other. (Row 2) Under an affine

model, triplets of points that span the two squares will incorrectly

propose a global stretching motion. This motion is likely to have

many inliers since all points on the inner edges of the squares will fit

this motion exactly. If we then delete all points that agree with this

transformation, we will be unable to detect the true motions of the

squares in the scene (Rows 3 & 4).

first issue is that combinations of correspondences—not individual correspondences—are what promote agiven transformation. Thus when “phantom motionfields” are present, i.e., transformations arising fromthe relative motion between two or more objects, itis possible that the deletion of correspondences couldprevent the detection of the true independent motions(see Fig. 2). Our approach does not perform sequentialdeletion of correspondences and thus circumvents thisproblem.

Another consideration arises from the fact that theRANSAC estimation procedure is based on correspon-dences between interest points in the two images. Thismakes the procedure biased towards texture rich re-gions, which have a large number of interest pointsassociated with them, and against small objects in thescene, which in turn have a small number of interestpoints. In the case where there is only one global trans-formation relating the two images, this bias does notpose a problem. However it becomes apparent whensearching for multiple independent motions. To correctfor this bias we introduce “perturbed interest points”and a method for feature crowdedness compensation.

Figure 3. Perturbed Interest Points. Correspondences are repre-

sented by point-line pairs where the point specifies an interest point

in the image and the line segment ends at the location of the cor-

responding point in the other image. (Row 1) We see one correct

correspondence and one incorrect correspondence that is the result

of an occlusion junction forming a white wedge. (Row 2) The points

around the correct point have matches that are near the corresponding

point, but the points around the incorrect correspondence do not.

Perturbed Interest Points: If an object is only repre-sented by a small number of interest points, it is un-likely that many samples will fall entirely within theobject. One approach for promoting the effect of cor-rect correspondences without promoting that of the in-correct correspondences is to appeal to the idea of astable system. According to the principle of perturba-tion, a stable system will remain at or near equilibriumeven as it is slightly modified. The same holds true forstable matches. To take advantage of this principle, wedilate the interest points to be disks with a radius ofrp, where each pixel in the disk is added to the list ofinterest points. This allows the correct matches to getsupport from the points surrounding a given featurewhile incorrect matches will tend to have almost ran-dom matches estimated for their immediate neighbors,which will not likely contribute to a widely-supportedwarp. In this way, while the density around a valid mo-tion is increased, we do not see the same increase inthe case of an invalid motion; (see Fig. 3).

Feature Crowdedness: Textured regions often have sig-nificant representation in the set of interest points. Thismeans that a highly textured object will have a muchlarger representation in the set of interest points thanan object of the same size with less texture. To miti-gate this effect, we bias the sampling. We calculate ameasure of crowdedness for each interest point and theprobability of choosing a given point is inversely pro-portional to this crowdedness score. The crowdednessscore is the number of interest points that fall into adisk of radius rc.

Page 5: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 129

Partitioning and Motion Estimation: Having perturbedthe interest points and established a sampling distribu-tion on them, we are now in a position to detect themotions present in the frames. We do so using a twostep variant of RANSAC, where multiple independentmotions are explicitly handled, as duplicate transfor-mations are detected and pruned in a greedy manner.The first step provides a rough partitioning of the setof correspondences (motion identification) and the sec-ond takes this partitioning and estimates the motion ofeach group (motion refinement).

First, a set of planar warps is estimated by a roundof standard RANSAC and inlier counts (using an inlierthreshold of τ ) are recorded for each transformation.In our case, we use planar homography which requires4 correspondences to estimate, however similarity oraffinity may be used (requiring 2 and 3 correspon-dences, respectively). The estimated list of transfor-mations is then sorted by inlier count and we keep thefirst nt transformations, where nt is some large number(e.g. 300).

We expect that the motions in the scene will likelybe detected multiple times and we would like to detectthese duplicate transformations. Comparing transfor-mations in the space of parameters is difficult for all butthe simplest of transformations, so we compare trans-formations by comparing the set of inliers associatedwith each transformation. If there is a large overlap inthe set of inliers (more than 75%) the transformationwith the larger set of inliers is kept and the other ispruned.

Now that we have our partitioning of the set of corre-spondences, we would like to estimate the planar mo-tion represented in each group. This is done with asecond round of RANSAC on each group with only100 iterations. This round has a tighter threshold tofind a better estimate. We then prune duplicate warps asecond time to account for slightly different inlier setsthat converged to the same transformation during thesecond round of RANSAC with the tighter threshold.

The result of this stage is a set of proposed trans-formations and we are now faced with the problem ofassigning each pixel to a candidate motion field.

3.1.3. Layer Assignment. The problem of assigningeach pixel to a candidate motion field can be formulatedas finding a function l : I → {1, . . . , m}, that mapseach pixel to an integer in the range 1, . . . , m, wherem is the total number of of motion fields, such that the

Figure 4. Example of naıve pixel assignment as in Eq. (1) for the

second motion layer in Fig. 6. Notice there are many pixels that are

erratically assigned. This is why smoothing is needed.

reconstruction error∑i

[I (i) − I ′(M(l(i), i))]2

is minimized. Here M(p, q) returns the position ofpixel q under the influence of the motion field p.

A naıve approach to solving this problem is to usea greedy algorithm that assigns each pixel the motionfield for which it has the least reconstruction error, i.e.,

l(i) = arg min1≤p≤m

[I (i) − I ′(M(p, i))]2 (1)

The biggest disadvantage of this method as can be seenin Fig. 4 is that for flat regions it can produce unstablelabellings, in that neighboring pixels that have the samebrightness and are part of the same moving object canget assigned to different warps. What we would likeinstead is to have a labelling that is piecewise constantwith the occasional discontinuity to account for gen-uine changes in motion fields.

The most common way this problem is solved(see e.g., Weiss, 1997) is by imposing a smoothnessprior over the set of solutions, i.e., an ordering thatprefers piecewise constant labellings over highly un-stable ones. It is important that the prior be sensitiveto true discontinuites present in the image. In Boykovet al. (1999) have shown that discontinuity preservingsmoothing can be performed by adding a penalty of thefollowing form to the objective function∑

i

∑j∈N (i)

si j[1 − δl(i)l( j)

]where δ·· is the Kronecker delta, equal to 1 when itsarguments are equal. Given a measure of similarity si j

between pixels i and j , it penalizes pixel pairs that havebeen assigned different labels. The penalty should only

Page 6: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

130 Wills, Agarwal and Belongie

be applicable for pixels that are near each other. Hencethe second sum is over a fixed neighborhood N (i). Thefinal objective function we minimize is∑

i

[I (i) − I ′(M(l(i), i))]2 + λ∑

i

∑j∈N (i)

si j[1 − δl(i)l( j)

]whereλ is the tradeoff between the data and the smooth-ness prior.

An optimization problem of this form is known as aGeneralized Potts model which in turn is special caseof a class of problems known as metric labelling prob-lems. Kleinberg & Tardos demonstrate that the metriclabelling problems corresponds to finding the maxi-mum a posteriori labelling of a class of Markov ran-dom field (Kleinberg and Tardos, 1999). The problemis known to be NP-complete, and the best one can hopefor in polynomial time is an approximation.

Recently Boykov et al. have developed a polyno-mial time algorithm that finds a solution with error atmost two times that of the optimal solution (Boykovet al., 2001). Each iteration of the algorithm constructsa graph and finds a new labelling of the pixels corre-sponding to the minimum cut partition in the graph. Thealgorithm is deterministic and guaranteed to terminatein O(m) iterations.

Besides the motion fields and the image pair, the al-gorithm takes as input a similarity measure si j betweenevery pair of pixels i, j within a fixed distance of oneanother and two parameters, k the size of the neigh-borhood around each pixel, and λ the tradeoff betweenthe data and the smoothness term. We use a Gaussianweighted measure of the squared difference betweenthe intensities of pixels i and j ,

si j = exp

[− d(i, j)2

2k2− (I (i) − I ( j))2

]where d(i, j) is the distance between pixel i and pixel j .

We run the BVZ algorithm twice, once to assign thepixels in the image I to the forward motion field andagain to assign the pixels in image I ′ to the inversemotion fields relating I ′ and I . If a point in the sceneoccurs in both frames, we expect that its position andappearance will be related as:

M(l(p), p) = p′

M(l ′(p′), p′) = p

I (M(l(p), p)) = I ′(p)

Here, the unprimed symbols refer to image I and theprimed symbols refer to image I ′. Assuming that theappearance of the object remains the same across theimages, the final assignment is obtained by intersectingthe forward and backward assignments.

In this simple intersection step, occluded pixelsare removed from further consideration. By reasoningabout occlusion ordering constraints over more thanthan two frames, one can retain and explicitly labeloccluded pixels in the output segmentation; see for ex-ample the recent work of Xiao and Shah (2004).

4. Experimental Results

We now illustrate our algorithm, which is summarizedin Fig. 5, on several pairs of images containing objectsundergoing independent motions. We performed all ofthe experiments on grayscale images with the sameparameters.1

Our first example is shown in Fig. 6. In this figure weshow the two images, I and I ′, and the assignments foreach pixel to a motion layer (one of the three detectedmotion fields). The rows represent the different motionfields and the columns represent the portions of eachimage that are assigned to a given motion layer. Themotions are made explicit in that the pixel support fromframe to frame is related exactly by a planar homogra-phy. Notice that the portions of the background and thedumpsters that were visible in both frames were seg-mented correctly, as was the man. This example showsthat in the presence of occlusion and when visual cor-respondence is difficult (i.e. matching the dumpsterscorrectly), our method provides good segmentation.Another thing to note is that the motion of the manis only approximately planar.

Figure 7 shows a scene consisting of two fish swim-ming past a fairly complicated reef scene. The seg-mentation is shown as in Fig. 6 and we see that three

Figure 5. Algorithm summary.

Page 7: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 131

Figure 6. Notting Hill sequence. (Row 1) Original image pair of

size 311 × 552, (Rows 2-4) Pixels assigned to warp layers 1–3 in Iand I ′.

motions were detected, one for the background and onefor each of the two fish. In this scene, the fish are small,feature-impoverished objects in front of a large feature-rich background, thus making the identification of themotion of the fish difficult. In fact, when this examplewas run without using the perturbed interest points, wewere unable to recover the motion of either of the fish.

Figure 8 shows two frames from a sequence that hasbeen a benchmark for motion segmentation approachesfor some time. Previously, only optical flow-based tech-niques were able to get good motion segmentation re-sults for this scene, however producing a segmentationof the motion between the two frames shown (1 and30) would require using all (or at least most) of theintermediate frames. Here the only input to the systemwas the frame pair shown in Row 1. Notice that theportions of the house and the garden that were visiblein both frames were segmented accurately as was thetree. This example shows the discriminative power ofour filterbank as we were unable to detect the motionfield correctly using correspondences found using thestandard technique of normalized cross correlation. Inaddition, this example demonstrates the importance ofthe perturbed interest points and the sampling basedon feature crowdedness as the correct motions were

Figure 7. Fish sequence. (Row 1) Original image pair of size 162×319, (Rows 2–4) Pixels assigned to warp layers 1–3 in I and I ′.

Figure 8. Flower Garden sequence. (Row 1) Original image pair

of size 240 × 360, (Rows 2–4) Pixels assigned to warp layers 1–3 in

I and I ′.

Page 8: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

132 Wills, Agarwal and Belongie

Figure 9. VW sequence. (Row 1) Original image pair of size 240×320, (Rows 2–3) Pixels assigned to warp layers 1–2 in I and I ′.

not detected when either of the two techniques werenot used.

In Fig. 9, a moving car passes behind a tree as thecamera pans. Here, only two motion layers were recov-ered and they correspond to the static background andto the car. Since a camera rotating around its opticalcenter produces no parallax for a static scene, the treeis in the same motion layer as the fence in the back-ground, whereas the motion of the car requires its ownlayer. The slight rotation in depth of the car does notpresent a problem here.

5. Extension to Non-Planar Motion

Consider the image pairs illustrated in Fig. 10. Thesehave a significant component of planar motion but ex-

Figure 10. Non-planarity vs. non-rigidity: The left image pair shows a non-planar object undergoing 3D rigid motion; the right pair shows an

approximately planar object undergoing non-rigid motion. Both examples result in residual with respect to a 2D planar fit.

hibit residual with respect to a planar fit because ofeither the non-planarity of the object (e.g. a cube) orthe non-rigidity of the motion (e.g. a lizard). These arescenes for which the motion can be approximately de-scribed by a planar layer-based framework, i.e. scenesthat have “shallow structure” (Sawhney and Hanson,1993). In order to extend our approach to such scenes,we propose an additional stage consisting of a regular-ized spline model for capturing finer scale variationson top of an approximate planar fit. Our approach isrelated in spirit to the deformotion concept in Soattoand Yezzi (2002), developed for the case of differentialmotion, which separates overall motion (a finite dimen-sional group action) from the more general deformation(a diffeomorphism).

It is important to remember that optical flow does notmodel the 3D motion of objects, but rather the changesin the image that result from this motion. Without theassumption of a rigid object, it is very difficult to es-timate the 3D structure and motion of an object fromobserved change in the image, though there is recentwork that attempts to do this (Brand, 2001; Brand andBhotika, 2001; Torresani et al., 2001, 2003; Torresaniand Hertzmann, 2004; Vidal and Ma, 2004; Xiao et al.,2004). For this reason, we choose to do all estimationin the image plane (i.e. we use 2D models), but weshow that if the object is assumed to be rigid, the corre-spondences estimated can be used to recover the densestructure and 3D motion.

5.1. Our Approach for Non-planar Motion

When the scene contains objects undergoing signifi-cant 3D motion or deformation, the optical flow can-not be described by any single low dimensional im-age plane transformation (e.g., affine or homography).However, to keep the problem tractable we need a com-pact representation of these transformations; we pro-pose the use of thin plate splines for this purpose. A

Page 9: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 133

Figure 11. Determining Long Range Optical Flow. The goal is to provide dense optical flow from the first frame (1), to the second (4). This is

done via a planar fit (2) followed by a flexible fit (3).

single spline is not sufficient for representing multipleindependent motions, especially when the motion vec-tors intersect (Weiss, 1997). Therefore we representthe optical flow between two frames as a set of dis-joint splines. By disjoint we mean that the support ofthe splines are disjoint subsets of the image plane. Thetask of fitting a mixture of splines naturally decomposesinto two subtasks: motion segmentation and splinefitting.

Ideally we would like to do both of these tasks simul-taneously, however these tasks have conflicting goals.The task of motion segmentation requires us to identifygroups of pixels whose motion can be described by asmooth transformation. Smoothness implies that eachmotion segment has the the same gross motion, how-ever, except for the rare case in which the entire layerhas exactly the same motion everywhere, there willbe local variations. Hence the motion segmentation al-gorithm should be sensitive to inter-layer motion andinsensitive to intra-layer variations. On the other hand,fitting a spline to each motion field requires attention toall the local variations. This is an example of differenttradeoffs between bias and variance in the two stagesof the algorithm. In the first stage we would like to ex-ert a high bias and use models with a high amount ofstiffness and insensitivity to local variations, whereasin the second stage we would like to use a more flexiblemodel with a low bias.

We begin with the motion segmentation procedureof Section 3. The output of this stage, while sufficientto achieve a good segmentation, is not sufficient to re-cover the optical flow accurately. However, it servestwo important purposes: firstly it provides an approxi-mate segmentation of the sparse correspondences thatallows for coherent groups to be processed separately.This is crucial for the second stage of the algorithm as aflexible model will likely find an unwieldy compromisebetween distinct moving groups as well as outliers. Sec-ondly, since the assignment is dense, it is possible to

find matches for points that were initially mismatchedby limiting the correspondence search space to pointsin the same motion layer. The second stage then boot-straps off of these estimates of motion and layer sup-port to iteratively fit a thin plate spline to account fornon-planarity or non-rigidity in the motion. Figure 11illustrates this process.

However, since splines form a family of universalapproximators over R2 and can represent any 2D trans-formation to any desired degree of accuracy, it raisesthe question as to why one needs to use two differentmotion models in the two stages of the algorithm. If onewere to use the affine transform as the dominant mo-tion model, splines with an infinite or very large degreeof regularization can indeed be used in its place. How-ever, in the case where the dominant planar motion isnot captured by an affine transform and we need to usea homography, it is not practical to use a spline. This isso because the set of homographies over any connectedregion of R2 are unbounded, and can in principal re-quire a spline with an unbounded number of knots torepresent an arbitrary homography. So while a homog-raphy can be estimated using a set of four correspon-dences, the corresponding spline approximation can, inprinciple, require an arbitrarily large number of controlpoints. This poses a serious problem for robust estima-tion procedures like RANSAC since the probability ofhitting the correct model decreases exponentially withincreasing degrees of freedom.

Many previous approaches for capturing long rangemotion are based on the fundamental matrix. How-ever, since the fundamental matrix maps points to lines,translations in a single direction with varying velocityand sign are completely indistinguishable, as pointedout, e.g. by Torr et al. (1995). This type of motion is ob-served frequently in motion sequences. The trifocal ten-sor does not have this problem; however, like the fun-damental matrix, it is only applicable for scenes withrigid motion and there is not yet a published solution

Page 10: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

134 Wills, Agarwal and Belongie

for dense stereo correspondence in the presence of mul-tiple motions.

We now describe the refinement stage of the algo-rithm in detail.

5.1.1. Refining the Fit with a Flexible Model. Theflexible fit is an iterative process using regularizedradial basis functions, in this case Thin Plate Spline(TPS). The spline interpolates the correspondences toresult in a dense optical flow field. This process is runon a per-motion layer basis.

Feature Extraction and Matching: During the planarmotion estimation stage, only a gross estimate of themotion is required so a sparse set of feature pointswill suffice. In the final fit however, we would liketo use as many correspondences as possible to en-sure a good fit. In addition, since the correspondencesearch space is reduced (i.e. matches are only consid-ered between pixels assigned to corresponding mo-tion layers), matching becomes somewhat simpler.For this reason, we use the Canny edge detector tofind the set of edge points in each of the frames andestimate correspondences in the same manner as inSection 3.

Iterated TPS Fitting: Given the approximate planar ho-mography and the set of correspondences between edgepixels, we would like to find the dense set of correspon-dences. If all of the correspondences were correct, wecould jump straight to a smoothed spline fit to obtaindense (interpolated) correspondences for the whole re-gion. However, we must account for the fact that manyof the correspondences are incorrect. As such, the pur-pose of the iterative matching is essentially to distin-guish inliers from outliers, that is, we would like toidentify sets of points that exhibit coherence in theircorrespondences.

One of the assumptions that we make about thescenes we wish to consider is that the motion of thescene can be approximated by a set of planar layers.Therefore a good initial set of inliers are those corre-spondences that are roughly approximated by the esti-mated homography. From this set, we use TPS regres-sion with increasingly tighter inlier thresholds to iden-tify the final set of inliers, for which a final fit is usedto interpolate the dense optical flow. We now brieflydescribe this process.

Thin Plate splines are a family of approximatingsplines defined over Rd . The theory for Thin PlateSplines was first developed by Duchon (1976, 1977)

and subsequently by Meinguet (1979). Our presenta-tion here follows follows Wahba (1990).

Our task here is to construct smoothly varying func-tions that map pixel positions in one image to pixelpositions in another. We use two splines, one each forthe x and y mappings. Let {(x1, y1), . . . , (xn, yn)} bethe positions of the points in the first image. Let thetarget for the first spline be given by vi and let f de-note the transformation that we are trying to estimate.In two dimensions the smoothness penalty for ThinPlate Splines is given by

J2 =∫ ∫

R2

(f 2xx + 2 f 2

xy + f 2yy

)dxdy

J2 is also known as the bending energy. The minimiza-tion is performed over the space of functions χ whosepartial derivatives of total order 2 are in L(R2), i.e., theintegral of square of every partial derivative of order2 over R2 is bounded. Meinguet (1979) provides a de-tailed description of this space. The functional J2( f )defines a semi-norm over χ .

The smoothing thin-plate spline is then defined to bethe solution to the following variational problem:

arg minf ∈χ

[1

n

n∑i

(vi − f (xi , yi ))2

∫ ∫R2

(f 2xx + 2 f 2

xy + f 2yy

)dxdy

](2)

where the scalar μ is the tradeoff between fitting thetarget values vi and the smoothness of the function f .

The null space of the penalty functional is a threedimensional space consisting of polynomials of degreeless than or equal to one, i.e., the space of all functionsspanned by the basis functions

φ1(x, y) = 1, φ2(x, y) = x, φ3(x, y) = y

ax + by + c, a, b, c ∈ R

Duchon (1976) showed that if the points{(x1, y1), . . . , (xn, yn)} are such that the leastsquares regression on φ1, φ2, φ3 is unique then thevariational problem above has a unique solution fμand is given by

fμ(x, y) =n∑i

wi Gi (x, y) + ax + by + c (3)

Here G(x, y) is the Green’s function to the twice it-erated Laplacian. It is also known as the fundamental

Page 11: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 135

solution to the bi-harmonic equation

�2G = 0

where,

Gi (r ) = r2 log r, r2 = (x − xi )2 + (y − yi )

2

Thus the calculation of the spline fit requires the esti-mation of the parameters wi and a, b, c.

Now let K be an n × n matrix with entries given by

Ki j = Gi (x j , y j )

and let T be a n × 3 matrix with rows given by

Ti = [ xi yi 1 ].

Also let d be the 3-vector

d = [ a b c ]�.

Then it can be shown that the optimal value for thecoefficient vector w = [wi ] is given by the solutionto the matrix equations (Wahba, 1990; Powell, 1995;Bookstein, 1989)

(K + nμI ) w + T d = 0 (4)

T �w = 0 (5)

This is a simple linear system that can be solved usingmatrix inversion. For the case of μ = 0 we obtain theinterpolating Thin Plate Spine.

The complexity of matrix inversion scales as O(n3)in the number of rows. Thus as the number of pointsthat we are fitting to goes up it is not practical to usethese methods on the full dataset. In our experimentswe take a naıve subsampling based approach. Out ofthe 1200 points that we are required to fit to, we ran-domly subsampled 500 points and used them as thelandmarks for the spline fitting procedure. We observethat this simple approach works well in practice. Anumber of researchers have explored more sophisti-cated and computationally attractive approaches to theproblem (Girosi et al., 1995; Smola and Scholkopf,2000; Weiss, 1997; Donato and Belongie, 2002). Anyof these can be used as a replacement for our splineestimation procedure.

We estimate the TPS mapping from the points in thefirst frame to those in the second where μt is the regu-larization factor for iteration t . The fit is estimated using

the set of correspondences that are deemed inliers forthe current transformation, where τt is the thresholdfor the t th iteration. After the transformation is esti-mated, it is applied to the entire edge set and the set ofcorrespondences is again processed for inliers, usingthe new locations of the points for error computation.This means that some correspondences that were out-liers before may be pulled into the set of inliers andvice versa. The iteration continues on this new set ofinliers where τt+1 ≤ τt and μt+1 ≤ μt . We have foundthat three iterations of this TPS regression with incre-mentally decreasing regularization and correspondingoutlier thresholds suffices for a large set of real worldexamples. Additional iterations produced no change inthe estimated set of inlier correspondences.

This simultaneous tightening of the pruning thresh-old and annealing of the regularization factor aid in dif-ferentiating between residual due to localization erroror mismatching and residual due to the non-planarityof the object in motion. When the pruning threshold isloose, it is likely that there will be some incorrect corre-spondences that will pass the threshold. This means thatthe spline should be stiff enough to avoid the adverseeffect of these mismatches. However, as the mappingconverges we place higher confidence in the set of cor-respondences passing the tighter thresholds. This pro-cess is similar in spirit to iterative deformable shapematching methods (Belongie et al., 2002; Chui andRangarajan, 2000).

5.2. Experimental Results

We now illustrate our algorithm, which is summarizedin Fig. 12, on several pairs of images containing ob-jects undergoing significant, non-planar motion. Sincethe motion is large, displaying the optical flow as a vec-tor field will result in a very confusing figure. Because

Figure 12. Algorithm summary.

Page 12: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

136 Wills, Agarwal and Belongie

Figure 13. Face Sequence. (Row 1) The two input images, I and I ′of size 240 × 320. (Row 2) The difference image is show first where

grey regions indicate zero error regions and the reconstruction, T (I )

is second. (Row 3) The initial segmentation found via planar motion.

of this, we show the quality of the optical flow in otherways, including (1) examining the image and corre-sponding reconstruction error that result from the appli-cation of the estimated transform to the original image(we refer to this transformed image as T (I )), (2) show-ing intermediate views (as in Seitz and Dyer, 1996), orby (3) showing the 3D reconstruction induced by the setof dense correspondences. Examples are presented thatexhibit either non-planarity, non-rigidity or a combina-tion of the two. We show that our algorithm is capableof providing optical flow for pairs of images that are be-yond the scope of existing algorithms. We performedall of the experiments on grayscale images using thesame parameters.2

Face Sequence: The first example is shown in Figs.13 and 14. The top row of Fig. 13 shows the two inputframes, I and I ′, in which a man moves his head tothe left in front of a static scene (the nose moves morethan 10% of the image width). The second row showsfirst the difference image between T (I ) and I ′ whereerror values are on the interval [−1, 1] and gray regionsindicate areas of zero error. This image is followed byT (I ); this image has two estimated transformations,one for the face and another for the background. No-tice that error in the overlap of the faces is very small,

Figure 14. Face Sequence—Interpolated views. (Row 1)

Original frame I ′, synthesized intermediate frame, original

frame I , (Row 2) A surface approximation from computed

dense correspondences.

which means that according to reconstruction error, theestimated transformation successfully fits the relationbetween the two frames. This transformation is non-trivial as seen in the change in the nose and lips as wellas a shift in gaze seen in the eyes, however all of this iscaptured by the estimated optical flow. The final row inFig. 13 shows the segmentation and planar approxima-tion from Section 3, where the planar transformation ismade explicit as the regions’ pixel supports are relatedexactly by a planar homography. Dense correspon-dences allow for the estimation of intermediate viewsvia interpolation as in Seitz and Dyer (1996). Figure 14shows the two original views of the segment associatedwith the face as well as a synthesized intermediate viewthat is realistic in appearance. The second row of thisfigure shows an estimation of relative depth that comesfrom the disparity along the rectified horizontal axis.Notice the shape of the nose and lips as well as the

Figure 15. Notting Hill. Detail of the spline fit for a layer

from Fig. 6, difference image for the planar fit, difference

image for the spline fit, grid transformation.

Page 13: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 137

Figure 16. Gecko Sequence. (Row 1) Original frame I of size 102 × 236, synthesized intermediate view, original frame I ′. (Row 2) T (I ),

Difference image between the above image and I ′ (gray is zero error), Difference image for the planar fit.

relation of the eyes to the nose and forehead. It is impor-tant to remember that no information specific to humanfaces was provided to the algorithm for this optical flowestimation.

Notting Hill Sequence: The next example shows howthe spline can also refine what is already a close approx-imation via planar models. Figure 15 shows a close upof the planar error image, the reconstruction error andfinally the warped grid for the scene that was shownin Fig. 6. The planar approximation was not able tocapture the 3D nature of the clothing and the non-rigidmotion of the head with respect to the torso, howeverthe spline fit captures these things accurately.

Gecko Sequence: The second example, shown inFig. 16, displays a combination of a non-planar object(a gecko lizard), undergoing non-rigid motion. Whilethis is a single object sequence, it shows the flexibil-ity of our method to handle complicated motions. InFigure 16(1), the two original frames are shown as wellas a synthesized intermediate view (here, intermediaterefers to time rather than viewing direction since weare dealing with non-rigid motion) . The synthesizedimage is a reasonable guess at what the scene wouldlook like midway between the two input frames. Figure16(2) shows T (I ) as well as the reconstruction error forthe spline fit (T (I )− I ′), and the error incurred with theplanar fit. We see in the second row of Fig. 16(2) thatthe tail, back and head of the gecko are aligned verywell and those areas have negligible error. When wecompare the reconstruction error to the error inducedby a planar fit, we see that the motion of the gecko isnot well approximated by a rigid plane. Here, there isalso some 3D motion present in that the head of thelizard changes in both direction and elevation. This iscaptured by the estimated optical flow.

Figure 17. Rubik’s Cube. (Row 1) Original image pair of

size 300 × 400, (Row 2) assignments of each image to layers

1 and 2.

Rubik’s Cube: The next example shows a scene withrigid motion of a non-planar object. Figure 17 displaysa Rubik’s cube and user’s manual switching places asthe cube rotates in 3D. Below the frames, we see thesegmentation that is a result of the planar approxima-tion. As can be seen the segmentation contains largechunks of the background along with the Rubik’s Cube.While it is indeed desirable that the only pixels thatwe segment are those belonging to the Rubik’s Cube,we must note that the background lacks any distin-guishing features making its motion truly ambiguous.Hence without additional knowledge about the objectsin the scene, any prior that we place on the scene whilesegmenting it will be the cause of some mistakes. Inour MRF-based segmentation scheme we make the as-sumption that the layers are spatially contiguous, thiscoupled with the motion ambiguity mentioned earlierresults in some portion of the background being inter-preted as belonging to the same layer as the Rubik’scube. Figure 18 shows T (I ), the result of the spline fit

Page 14: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

138 Wills, Agarwal and Belongie

Figure 18. Rubik’s Cube—Detail. (Row 1) Original frame I , synthesized intermediate frame, original frame I ′, A synthesized novel view,

(Row 2) difference image for the planar fit, difference image for the spline fit, T (I ), the estimated structure shown for the edge points of I . We

used dense 3D structure to produce the novel view.

applied to this same scene. The first row shows a detailof the two original views of the Rubik’s cube as wellas a synthesized intermediate view. Notice that the ro-tation in 3D is accurately captured and demonstratedin this intermediate view. The second row shows thereconstruction errors, first for the planar fit and thenfor the spline fit, followed by T (I ). Notice how ac-curate the correspondence is since the spline appliedto the first image is almost identical to the secondframe.

Correspondences between portions of two framesthat are assumed to be projections of rigid objects inmotion allow for the recovery of the structure of theobject, at least up to a projective transformation. InTomasi and Kanade (1991), the authors show a sparsepoint-set from a novel viewpoint and compare it to areal image from the same viewpoint to show the accu-racy of the structure. Figure 18 shows a similar result,however since our correspondences are dense, we canactually render the novel view that validates our struc-ture estimation. The novel viewpoint is well above theobserved viewpoints, yet the rendering as well as thedisplayed structure is fairly accurate. Note that onlythe set of points that were identified as edges in I areshown; this is not the result of simple edge detectionon the rendered view. We use this display conventionbecause the entire point-set is too dense to allow theperception of structure from a printed image. However,the rendered image shows that our estimated structurewas very dense. It is important to note that the onlyassumption that we made about the object is that itis a rigid, piecewise smooth object. To achieve simi-

lar results from sparse correspondences would requireadditional object knowledge, namely that the object inquestion is a cube and has planar faces. It is also impor-tant to point out that this is not a standard stereo pairsince the scene contains multiple objects undergoingindependent motion.

6. Applications

6.1. Automatic Object Removal

We demonstrate an application of our algorithm to theproblem of video object deletion in the spirit of Iraniand Peleg (1993), and Wang and Adelson (1993) (seeFig. 19). The idea of using motion segmentation infor-mation to fill in occluded regions is not new, howeverprevious approaches require a high frame rate to en-sure that inter-frame disparities are small enough fordifferential optical flow to work properly. Here the in-terframe disparities are as much as a third of the imagewidth.

6.2. Structure from Periodic Motion

The periodicity of moving objects such as pedestri-ans has been widely recognized as a cue for salientobject detection in the context of tracking and surveil-lance (see for example, Cutler and Davis, 2000; Liuet al., 2002). In addition, it can be used for 3D re-construction. The key idea is very simple. Given a

Page 15: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 139

Figure 19. Illustration of video object deletion. (1) Original frames of size 180 × 240. (2) Segmented layer corresponding to the motion of the

hand. (3) Reconstruction without the hand layer using the recovered motion of the keyboard. Note that no additional frames beyond the three

shown were used as input.

monocular video sequence of a periodic moving ob-ject, any set of period-separated frames represents acollection of snapshots of a particular pose of the mov-ing object from a variety of viewpoints. This is il-lustrated in Fig. 20. Thus each complete period intime yields one view of each pose assumed by themoving object, and by finding correspondences inframes across neighboring periods in time, one can ap-ply standard techniques of multiview geometry, withthe caveat that in practice such periodicity is onlyapproximate.

One consequence of this matching between phase-separated frames is that the motion of the object be-tween the two frames can be quite large. This is wherewe can apply our motion segmentation technique tosegment out the object in question before attemptingthe reconstruction. In Fig. 21, we show the results fromthe segmentation for two frames.

Figure 22(c) shows the sparse 3D structure recov-ered for the To-separated frames of a walking personshown in Fig. 22(a,b). To get this reconstruction, we es-timated the epipolar geometry between the two frames,performed bundle adjustment to improve the estima-tion and computed depth as described in Hartley andZisserman (2000). A detail of the head and left shoulderregion is shown in Fig. 22(d) from a viewpoint behindthe person and slightly to the left. Here we can see thatthe qualitative shape of the head relative to the sleeveregion is reasonable.

The set of points used here consists of (i) the Forstnerinterest points used to estimate the fundamental matrixand (ii) the neighboring Canny edges with correspon-dences consistent with the epipolar geometry. Manypoints appear around the creases in the clothing, butthis leaves several blank patches around the lower shirtand the arm.

Page 16: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

140 Wills, Agarwal and Belongie

Figure 20. Illustration of periodic motion for a walking person. Equally spaced frames from one second of footage are shown. The pose of the

person is approximately the same in the first and last frames, but the position relative to the camera is different. Thus this pair of frames can be

treated approximately as a stereo pair for purposes of 3D structure estimation. Note that while the folds in the clothing change over time, their

temporal periodicity makes them rich features for correspondence recovery across periods.

Figure 21. (a,b) Segmentation of Region of Interest for To-separated input frames.

7. Discussion

In this paper we have presented a new method for per-forming dense motion segmentation and estimation inthe presence of large inter-frame motion.

Like any system, our system is limited by the as-sumptions it makes. We make three assumptions aboutthe scenes:

1. Identifiability2. Constant appearance3. Dominant planar motion.

A system is identifiable if its internal parameters canbe estimated given the data. In the case of motion seg-mentation it implies that given a pair of images it ispossible to recover the underlying motion. The min-

imal requirement under our chosen motion model isthat each object present in the two scenes should beuniquely identifiable. Consider Fig. 23; in this display,several motions can relate the two frames, and unlesswe make additional assumptions about the underlyingproblem, it is ill posed. Similarly in some of the exam-ples we can see that while the segments closely matchthe individual objects in the scene, some of the back-ground bleeds into each layer. Motion is just one ofseveral cues used by the human vision system in per-ceptual grouping and we cannot expect a system basedpurely on the cues of motion and brightness to be ableto do the job. Incorporation of the various Gestalt cuesand priors on object appearance will be the subject offuture research.

Our second assumption is that the appearance of anobject across the two frames remains the same. While

Page 17: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 141

Figure 22. (a, b) To-separated input frames. (c) Estimated 3D structure for interest points. Here the viewpoint is diagonally behind the person

and we can see the relative depths of the legs and arm as well as the head. (d) Detailed view of head and shoulder region viewed from behind

the person which shows the outline of the hair, neck and shirt-sleave.

we do not believe that this assumption can be doneaway with completely, it can be relaxed. Our featureextraction, description, and matching is based on a fixedset of filters. This gives us a limited degree of rotation

and scale invariance. We believe that the matching stageof our algorithm can benefit from the work on affineinvariant feature point description (Mikolajczyk andSchmid, 2002) and feature matching algorithms based

Page 18: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

142 Wills, Agarwal and Belongie

Figure 23. Ternus Display. The motion of the dots is ambiguous;

additional assumptions are needed to recover their true motion.

on spatial propagation of good matches (Lhuillier andQuan, 2002).

Our third assumption is that the individual motionfields are predominantly planar. This is not a strict re-quirement and is only needed insofar as we are able toobtain the initial planar fits. The actual motion estimateand segmentation is based on the more flexible splinebased model.

8. Conclusion

In this paper, we have presented a solution to the prob-lem of motion segmentation for the case of large dis-parity motion and given experimental validation of ourmethod. We have also presented an extension to handlenon-planar/non-rigid motion as well as applications toautomatic object deletion and structure from periodicmotion. Our approach combines the strengths of thefeature-based approaches (i.e., no limits on the dis-parity between frames) and the the direct, optical flow-based methods (i.e., provides a dense segmentation andcorrespondences).

Acknowledgments

Our approach for motion segmentation has appearedin Wills et al. (2003), the extension to non-planar mo-tion has appeared in Wills and Belongie (2004), andthe approach for structure from periodic motion hasappeared in Belongie and Wills (2004).

We would like to thank Charless Fowlkes and BenOchoa for helpful discussions. The images in Figs. 13and 14 are used courtesy of Dr. Philip Torr. This workwas partially supported under the auspices of the U.S.Department of Energy by the Lawrence Livermore Na-tional Laboratory under contract No. W-7405-ENG-48,by an NSF IGERT Grant (Vision and Learning in Hu-mans and Machines, #DGE-0333451), by an NSF CA-REER Grant (Algorithms for Nonrigid Structure fromMotion, #0448615) and by The Alfred P. Sloan Re-search Fellowship.

Notes

1. Ns = 104, nt = 300, rp = 2, rc = 25, τ = 10, k = 2, λ = .285.

Image brightnesses are in the range [0, 1].

2. k = 2, λ = .285, τp = 15, μ1 = 50, μ2 = 20, μ3 = 1, τ1 =15, τ2 = 10, τ3 = 5. Here, k, λ, and τp refer to parameters in

Section 3.

References

Ayer, S. and Sawhney, H. 1995. Layered representation of motion

video using robust maximum-likelihood estimation of mixture

models and mdl encoding. In ICCV 95, pp. 777–784.

Belongie, S., Malik, J., and Puzicha, J. 2002. Shape matching and

object recognition using shape contexts. IEEE Trans. Pattern Anal-ysis and Machine Intelligence, 24(4):509–522.

Belongie, S. and Wills, J. 2004. Structure from periodic motion.

In Spatial Coherence for Visual Motion Analysis, Prague, Czech

Republic.

Black, M. and Jepson, A. 1996. Estimating optical flow in segmented

images using variable-order parametric models with local defor-

mations. T-PAMI, 18:972–986.

Bookstein, F.L. 1989. Principal warps: Thin-plate splines and de-

composition of deformations. IEEE Trans. Pattern Analysis andMachine Intelligence, 11(6):567–585.

Boykov, Y., Veksler, O., and Zabih, R. 1999. Approximate energy

minimization with discontinuities. In IEEE International Work-shop on Energy Minimization Methods in Computer Vision, pp.

205–220.

Boykov, Y., Veksler, O., and Zabih, R. 2001. Efficient approximate

energy minimization via graph cuts. IEEE Transactions on PatternAnalysis and Machine Intelligence, 20(12):1222–1239.

Brand, M. 2001. Morphable 3d models from video. In CVPR01, pp.

II:456–463.

Brand, M. and Bhotika, R. 2001. Flexible flow for 3d nonrigid track-

ing and shape recovery. In CVPR01, pp. I:315–322.

Chui, H. and Rangarajan, A. 2000. A new algorithm for non-rigid

point matching. In Proc. IEEE Conf. Comput. Vision and PatternRecognition, pp. 44–51.

Cutler, R. and Davis, L. 2000. Robust real-time periodic motion de-

tection, analysis, and applications. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22(8).

Darrell, T. and Pentland, A. 1991. Robust estimation of a multi-layer

motion representation. In Proc. IEEE Workshop on Visual Motion,

Princeton, NJ.

Donato, G. and Belongie, S. 2002. Approximate thin plate spline

mappings. In Proc. 7th Europ. Conf. Comput. Vision, Vol. 2, pp.

531–542.

Duchon, J. 1976. Fonction-spline et esperances conditionnelles de

champs gaussiens. Ann. Sci. Univ. Clermont Ferrand II Math.,14:19–27.

Duchon, J. 1977. Splines minimizing rotation-invariant semi-norms

in Sobolev spaces. In Constructive Theory of Functions of SeveralVariables, W. Schempp and K. Zeller (Eds.), Berlin: Springer-

Verlag, pp. 85–100.

Fischler, M. and Bolles, R. 1981. Random sample consensus: A

paradigm for model fitting with application to image analysis and

automated cartography. Commun. Assoc. Comp. Mach., 24:381–

395.

Page 19: A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion

Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion 143

Forstner, W. and Gulch, E. 1987. A fast operator for detection and

precise location of distinct points, corners and centres of circular

features. In Intercommission Conference on Fast Processing ofPhotogrammetric Data, Interlaken, Switzerland, pp. 281–305.

Freeman, W.T. and Adelson, E.H. 1991. The design and use of steer-

able filters. IEEE Trans. Pattern Analysis and Machine Intelli-gence, 13(9):891–906.

Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory

and neural networks architectures. Neural Computation, 7(2):219–

269.

Hartley, R.I. and Zisserman, A. 2000. Multiple View Geome-try in Computer Vision, Cambridge University Press, ISBN:

0521623049.

Irani, M. and Anandan, P. 1999. All about direct methods. In VisionAlgorithms: Theory and Practice, W. Triggs, A. Zisserman and R.

Szeliski (Eds.), Springer-Verlag.

Irani, M. and Peleg, S. 1993. Motion analysis for image en-

hancement: Resolution, occlusion, and transparency. Journalof Visual Communication and Image Representation, 4(4):324–

335.

Jones, D. and Malik, J. 1992. Computational framework to deter-

mining stereo correspondence from a set of linear spatial filters.

Image and Vision Computing, 10(10):699–708.

Kleinberg, J. and Tardos, E. 1999. Approximate algorithms for clas-

sification problems with pairwise relationships: Metric labelling

and markov random fields. In Proceedings of the IEEE Symposiumon Foundations of Computer Science.

Lhuillier, M. and Quan, L. 2002. Match propagation for image-based

modeling and rendering. IEEE Transactions on Pattern Analysisand Machine Intelligence, 24 (8):1140–1146.

Liu, Y., Collins, R., and Tsin, Y. 2002. Gait sequence analysis using

Frieze patterns. In Proc. 7th Europ. Conf. Comput. Vision.

Meinguet, J. 1979. Multivariate interpolation at arbitrary points made

simple. J. Appl. Math. Phys. (ZAMP), 5:439–468.

Mikolajczyk, K. and Schmid, C. 2002. An affine invariant inter-

est point detector. In European Conference on Computer Vision,

Springer, Copenhagen, pp. 128–142.

Odobez, J.-M. and Bouthemy, P. 1998. Direct incremental model-

based image motion segmentation for video analysis. Signal Pro-cessing, 66(2):143–155.

Powell, M.J.D. 1995. A thin plate spline method for mapping curves

into curves in two dimensions. In Computational Techniques andApplications (CTAC95), Melbourne, Australia.

Sawhney, H.S. and Hanson, A.R. 1993. Trackability as a cue for

potential obstacle identification and 3D description. InternationalJournal of Computer Vision, 11(3):237–265.

Seitz, S.M. and Dyer, C.R. 1996. View morphing. In SIGGRAPH,

pp. 21–30.

Smola, A. and Scholkopf, B. 2000. Sparse greedy matrix approxi-

mation for machine learning. In ICML.

Soatto, S. and Yezzi, A.J. 2002. Deformotion: Deforming motion,

shape average and the joint registration and segmentation of im-

ages. In European Conference on Computer Vision, Springer,

Copenhagen, pp. 32–47.

Szeliski, R. and Coughlan, J. 1994. Hierarchical spline-based im-

age registration. In IEEE Conference on Computer Vision PatternRecognition, Seattle, Washington, pp. 194–201.

Szeliski, R. and Shum, H.-Y. 1996. Motion estimation with quadtree

splines. IEEE Transactions on Pattern Analysis and Machine In-telligence, 18(12):1199–1210.

Tomasi, C. and Kanade, T. 1991. Factoring image sequences into

shape and motion. In Proc. IEEE Workshop on Visual Motion,

IEEE.

Torr, P.H.S. 1998. Geometric motion segmentation and model se-

lection. In Philosophical Transactions of the Royal Society A,

J. Lasenby, A. Zisserman, R. Cipolla, and H. Longuet-Higgins

(Eds.), Roy Soc, pp. 1321–1340.

Torr, P.H.S., Szeliski, R., and Anandan, P. 1999. An integrated

Bayesian approach to layer extraction from image sequences. In

Seventh International Conference on Computer Vision, Vol. 2, pp.

983–991.

Torr, P.H.S., Zisserman, A., and Murray, D.W. 1995. Motion cluster-

ing using the trilinear constraint over three views. In Europe-ChinaWorkshop on Geometrical Modelling and Invariants for ComputerVision, R. Mohr and C. Wu (Eds.), Springer–Verlag, pp. 118–125.

Torresani, L., Bregler, C., and Hertzmann, A. 2003. Learning non-

rigid 3d shape from 2d motion. In NIPS 2003.

Torresani, L. and Hertzmann, A. 2004. Automatic non-rigid 3d mod-

eling from video. In ECCV04, Vol. II, pp. 299–312.

Torresani, L., Yang, D., Alexander, G., and Bregler, C. 2001. Track-

ing and modelling non-rigid objects with rank constraints. In IEEEConference on Computer Vision and Pattern Recognition, Kauai,

Hawaii, pp. 493–500.

Vidal, R. and Ma, Y. 2004. A unified algebraic approah to 2-d and 3-

d motion segmentation. In Proc. European Conf. Comput. Vision,

Prague, Czech Republic.

Wahba, G. 1990. Spline Models for Observational Data, SIAM.

Wang, J. and Adelson, E.H. 1993. Layered representation for motion

analysis. In Proc. Conf. Computer Vision and Pattern Recognition,

pp. 361–366.

Weiss, Y. 1997. Smoothness in layers: Motion segmentation using

nonparametric mixture estimation. In Proc. IEEE Conf. Comput.Vision and Pattern Recognition, pp. 520–526.

Wills, J., Agarwal, S., and Belongie, S. 2003. What went where. In

Proc. IEEE Conf. Comput. Vision and Pattern Recognition, vol. 1,

Madison, WI, June 2003, pp. 37–44.

Wills, J. and Belongie, S. 2004. A feature-based approach for de-

termining long range correspondences. In Proc. European Conf.Comput. Vision, vol. 3, Prague, Czech Republic, pp. 170–182.

Xiao, J., Chai, J., and Kanade, T. 2004. A closed-form solution to

non-rigid shape and motion recovery. In Proc. European Conf.Comput. Vision, Prague, Czech Republic.

Xiao, J. and Shah, M. 2004. Motion layer extraction in the presence

of occlusion using graph cuts. In CVPR04, Washington, D.C.