Exploring Defocus Matting -...

Published by the IEEE Computer Society 0272-1716/07/$25.00 © 2007 IEEE IEEE Computer Graphics and Applications 43

Computational Photography

Image matting and compositing are impor-tant operations in image editing, photog-

raphy, and film production. Matting separates aforeground element from an image by estimating acolor, F, and an opacity, �, for each foreground pixel.The set of all � values is the alpha matte. Compositingblends the extracted foreground element, F, on top ofan opaque background image, B, using linear blending:

I[x;y] � �F � (1 � �)B. Given animage, I, matting solves the inverseproblem with seven unknowns (�,Fr, Fg, Fb, Br, Bg, Bb) and three con-straints (Ir, Ig, Ib).

To make the matting problemtractable, most commercial mattingapproaches use a background withknown, constant color. This is calledblue screen matting (see the “Previ-ous Work in Matting and Composit-ing” sidebar), even though green ispreferable when shooting with dig-ital cameras. Unfortunately, bluescreen matting is an intrusive,expensive process unavailable toamateur users. The ideal mattingapproach works for scenes witharbitrary, unknown, and possiblydynamic backgrounds. This natur-al image matting typically requiressubstantial manual labor. However,two fully automatic natural imagematting solutions have recentlybeen developed. They acquire addi-tional data during scene captureusing special imaging devices.

In previous work, we (Joshi,Matusik, and Avidan) describe using

a camera array to create a synthetic aperture image thatfocuses on the foreground object.1 We estimate the vari-ance of foreground and background pixels and computea high-quality alpha matte at several frames per second(fps). We also show how to transfer the computed alphamatte to one of the cameras in the array using image-

based rendering. Although camera arrays might be feasible for professional use, they can be impractical forcertain applications.

McGuire et al.2 developed a fully automated methodthat computes alpha mattes using three video camerasthat share a common center of projection but vary indepth of field and focal plane. The additional defocusinformation constrains the original ill-posed mattingproblem. This defocus matting approach can computehigh-quality mattes from natural images without userassistance. The special-purpose defocus camera usesthree imagers that share an optical axis using beam split-ters. Defocus matting is fully automatic, passive, andcould be implemented compactly using methods similarto those used for constructing 3-charge-coupled-device(CCD) cameras. However, the approach also has sever-al limitations:

■ Each camera’s resolution limits the alpha matte’s res-olution. All three must have the same resolution, soproducing high-resolution results requires that youuse three high-resolution cameras.

■ Matte computation is time consuming, taking sever-al minutes per frame. Consequently, it’s somewhatimpractical for long video clips.

■ Defocus matting is impractical for photographers,who might be reluctant to place a beam splitterbetween their high-definition, high-quality camerasand the scene.

In this article, we address these limitations and extenddefocus matting in several important ways.

Improvements on defocus mattingOur improved hardware setup allows more accurate

alignment of cameras and beam splitters and more effi-cient light splitting, thus improving the input images’noise characteristics.

Our approach also addresses the speed of McGuire etal.’s optimization procedure. Their process requiresmany minutes of computation per frame and, whenused for processing videos, runs independently on eachvideo frame. We accelerate the method for video

Defocus matting is a fullyautomatic and passive methodfor pulling mattes from videocaptured with coaxial camerasthat have different depths offield and planes of focus.Nonparametric sampling canaccelerate the video-mattingprocess from minutes toseconds per frame. In addition,a super-resolution techniqueefficiently bridges the gapbetween mattes from high-resolution video cameras andthose from low-resolutioncameras. Off-center mattingpulls mattes for an externalhigh-resolution camera thatdoesn’t share the same centerof projection as the low-resolution cameras used tocapture the defocus mattingdata.

Neel JoshiUniversity of California, San Diego

Wojciech Matusik, Shai Avidan, and Hanspeter PfisterMitsubishi Electric Research Labs

William T. FreemanMassachusetts Institute of Technology

Exploring DefocusMatting Nonparametric Acceleration, Super-Resolution, and Off-Center Matting


44 March/April 2007

sequences by using a nonparametric sampling techniquethat relies on the optimization result from previousframes.

Third, we consider a setup consisting of a defocuscamera where the primary camera is high resolutionand the other two cameras are low resolution. We showhow to handle the resolution difference by adapting animage-restoration technique to perform super-resolu-tion to create alpha mattes that match the primary cam-era’s resolution.

This setup gives good results, but it requires placingbeam splitters in front of all the cameras, which mightbe undesirable in some situations. Thus, we extend defo-cus matting to handle the case where an off-center cam-era doesn’t share a center of projection with thelow-resolution cameras. In such a setup, we attach thethree defocus cameras next to the off-center camera andalign them in software instead of hardware.

The A-CamThe A-Cam consists of a compact collection of three

video cameras, each with a different plane of focus anddepth of field, that share a single center of projection(Figure 1). The A-Cam is an improved version ofMcGuire et al.’s camera for defocus matting.2

The three cameras are axis-aligned and image thescene through a tree of beam-splitters. We focus one

camera, IF, on the foreground, another camera, IB, onthe background, and a pinhole camera, IP, on the entirescene. The foreground and background cameras havea large aperture with a narrow depth of field, whereasthe pinhole camera has a small aperture with a largedepth of field. Figures 2a to 2c give examples of theseimages.

The defocus in the foreground and background cam-eras occurs because the cone of rays from a point in thescene intersects the image plane at a disk. We describethe resulting point-spread function (PSF) or circle ofconfusion as:

where the camera is focused at depth zF , the point is atzR, � is the f-number, f is the focal length, and � is thewidth of a pixel.2 Depths are positive distances in frontof the lens. To a good approximation, we can expressimages from the three cameras, IP, IB, and IF, using thefollowing equations:

IP � �F � (1 � �)B (1)IB � (�F) � disk(rF) � (1 � � � (disk(rF))B (2)IF � �F � (1 � �)(B � disk(rB)) (3)

rf z z f

z z f

R F

F R

=σφ

−

− −2 1

( )( )

Previous Work in Matting and Compositing

Matting and compositing are important tasks intelevision, film production, and publishing and haveattracted researchers’ attention since at least the late 1950s.Vlahos’s initial work on matting and compositing led to thedevelopment of the UltiMatte system. Wallace1 and Smithand Blinn2 formalized digital compositing for filmproduction mathematically, including the invention of two-background matte extraction.

Two-background matte extraction shows that by imaginga foreground object against two backgrounds with differentintensity (or color), you can derive an expression for the �and foreground color. Zongker et al.3 introducedenvironment matting, an extension of alpha matting tocover more complex light transport effects (such as specularreflection or refraction).

Estimating � and foreground color for scenes in whichthe background can’t be controlled is often referred to asnatural image matting. This task is much more difficultbecause the solution is typically underconstrained. Toconstrain the problem, most methods make assumptionsabout the frequency content of background, foreground,and alpha.4-6 Furthermore, they rely on a user-specifiedtrimap that segments the image into definitely foreground,definitely background, and unknown regions. Thesealgorithms analyze unknown pixels using local colordistributions. Natural image matting methods have beenextended to video by propagating user-specified trimapswith optical flow.7

The Zcam (www.3dvsystems.com) and the work ofYasuda et al.8 follow some of the same principles as our

method in that they film additional data during capture toaid in matting. Both approaches also include a beam splitterso the extra camera shares the same optical axis as the maincamera; however, the Zcam uses active illumination, whichis undesirable for many applications, while the work ofYasuda et al. focuses on segmenting people using passive IR.

References1. B. Wallace, “Automated Production Techniques in Cartoon Ani-

mation,” master’s thesis, Cornell Univ., 1982.2. A.R. Smith and J.F. Blinn, “Blue Screen Matting,” Computer Graph-

ics, Proc. Siggraph, vol. 30, ACM Press, 1996, pp. 259–268.3. D.E. Zongker et al., “Environment Matting and Compositing,”

Proc. 26th Ann. Conf. Computer Graphics and Interactive Tech-niques, ACM Press/Addison-Wesley, 1999, pp. 205-214.

4. M. Ruzon and C. Tomasi, “Alpha Estimation in Natural Images,”Proc. IEEE Computer Soc. Conf. Computer Vision and Pattern Recog-nition (CVPR), IEEE CS Press, 2000, pp. 18-25.

5. Y.-Y. Chuang et al., “A Bayesian Approach to Digital Matting,”Proc. IEEE Computer Soc. Conf. Computer Vision and Pattern Recog-nition (CVPR), IEEE CS Press, 2001, pp. 264-271.

6. J. Sun et al. “Poisson Matting,” ACM Trans. Graphics, vol. 23, no.3, 2004, pp. 315-321.

7. Y.-Y. Chuang et al., “Video Matting of Complex Scenes,” ACMTrans. Graphics, vol. 21, no. 3, 2002, pp. 243-248.

8. K. Yasuda, T. Naemura, and H. Harashima, “Thermo-key: HumanRegion Segmentation from Video,” IEEE Computer Graphics andApplications, vol. 24, no. 1, 2004, pp. 26-30.

Equations 1, 2, and 3 give us seven unknowns (�, Fr,Fg, Fb, Br, Bg, Bb) and nine constraints (three per equa-tion). The convolutions with disk(rF) and disk(rB) sug-gest that we can’t solve for each pixel independently.Instead, we search for a global solution that minimizesthis error function:

J � (IP � ÎP)2 � (IB � ÎB)2 � (IF � ÎF)2 (4)

where IP, IF, and IB are the observed images, and ÎP, ÎF,and ÎB are the reconstructed images. We find the mini-mum using McGuire et al.’s sparse nonlinear optimiza-tion procedure.2 This procedure finds a solution byiteratively updating initial values for the unknowns, tak-ing steps along the error function’s gradient toward aminimum value. Because the convolutions in Equations2 and 3 are linear operations and Equation 1 is also lin-ear, we can efficiently compute these equations’ deriv-atives and then easily compute the error function’sgradient. The optimization procedure also includes sev-eral regularization terms, such as terms enforcing spa-tial smoothness for the recovered �, F, and B.

We’ve improved the defocus camera hardware intwo ways.

First, we constructed a rigid case to hold the camerasand beam splitters such that the optical centers are thesame. This improves the cameras’ overall alignment. Inaddition, the defocus camera is now self-contained, andthe entire unit is much more portable and can be mount-ed on a tripod, as Figure 1 shows.

We also modified the beam splitters. The originaldesign uses two 50/50 beam splitters. The pinhole cam-era receives half the light and the foreground- and back-ground-focused cameras each receive one-quarter.However, because the camera apertures don’t reflectthis division of light, the pinhole doesn’t receive enoughlight and the other cameras receive more than theyneed. As a result, the pinhole camera’s images sufferfrom noise because of lack of light, and the other twocameras tend to be overexposed as they receive toomuch light. We modified the design to use a 90/10 beamsplitter to send 90 percent of the light to the pinhole withthe remaining 10 percent split equally between the othertwo cameras. The ratio of the aper-tures now matches these splittingratios so that each sensor receivesthe same amount of light.

Even though the three camerasare better aligned and matched inlight exposure, we still need to color-metrically and geometrically alignthem. We can calibrate colorthrough simple histogram matchingto match the A-Cam’s foreground-and background-focused cameras tothe pinhole camera. We compute thehistogram remapping on smoothedversion images to limit the effect ofnoise in the calibration process. Thecameras’ geometric alignment isstraightforward. Because the cam-eras are coaxial, we need to correct

only for each camera sensor’s possible rotation and trans-lation, which we do using one homography per camera.We do this by placing a grid in the scene, detecting fea-ture points on the grid, and computing a homographyfor the foreground and background cameras to alignthem to the pinhole camera.

Nonparametric video mattingThe original defocus matting work makes no distinc-

tion between still image matting and video matting,and creates mattes for video by processing each frameindependently. With many minutes of CPU time perframe and video rates of 30 frames per second (fps),processing video clips quickly becomes impractical. Toaddress this issue, we accelerate the video-mattingprocess using a nonparametric model built on-the-flyfrom our optimization results. For videos with strongtemporal consistency, where “past predicts future” isa reliable assumption, this process provides a signifi-cant speedup. Our method starts with a training phasein which we individually process the first k frames of avideo sequence using our full optimization procedure(we use k � 5). Then, during the processing phase, weuse the results of these k frames to predict the resultfor a subsequent frame i using a method similar toimage analogies.3

Specifically, we create an input feature vector, , foreach pixel. This vector contains the corresponding pixelvalues from IP, IB, and IF and the disk(rB) and disk(rF) sizeneighborhoods of pixels from IB and IF, respectively. (Thesizes of disk(rB) and disk(rF) are a function of the cam-era lens settings and the foreground and backgrounddepths. These disks model the lens’ PSF at the fore-ground and background depths2—as a rule of thumb,they’re generally on the order of two to 10 pixels wide forcommon defocus matting setups.) We don’t use a neigh-borhood from IP because surrounding pixels don’t affectthe pinhole image constraint. For the neighborhood ofpixels, instead of using pixel values directly, we use arotationally invariant transform of the pixel data.4 Thistransform collapses the feature space significantly,allowing our model to generalize to a larger range ofinput data with many fewer samples. We populate a

IEEE Computer Graphics and Applications 45

1 The A-Cam. (a) The A-Cam is a collection of three video cameras that share a single center ofprojection by using a tree of beam splitters. The black case houses the cameras and beamsplitters. (b) The A-Cam (red rectangle) is attached to a standard consumer-level camera(green rectangle) for off-center matting.

(a) (b)

lookup table with pairs ( (p), [�(p), B(p), F(p)]) foreach pixel, p, for each of the training frames 1 throughk. We then predict values for a subsequent frame i usinga process similar to that used by Hertzmann et al.3 Inscanline order, for pixel q in frame i, we

1. Construct input feature vector (q).2. Find five nearest-neighbor pixels (the five pixels

closest to q in terms of distance, dj � ❘❘( (q) � (pj))2).

3. From these nearest-neighbor pixels, select thesmoothest pixel, psmooth—that is, the pixel whosesum of squared difference between its correspond-ing � and the � values for the already computedneighboring pixels in the alpha matte is minimal.

4. Set dsmooth � ❘❘( (q) � (psmooth))2.5. From the nearest-neighbor pixels, find the nearest

pixel, pn—that is, the pixel whose distance, dn � ❘❘((q) � (pn))2 is smallest.

6. If dsmooth �dn, [�(q), B(q), F(q)] � [�(psmooth), B(psmooth);F(psmooth)]

else[�(q), B(q), F(q)] � [�(pn), B(pn), F(pn)].

Intuitively, we construct the result for frame i pixel-by-pixel, where we set the result equal to the correspond-ing result for either the closest match, or equal to theresult for a close match whose corresponding alpha issimilar to those already computed in the image. The �

parameter lets us adjust theresult’s smoothness. We’veempirically found that � � 2works well. For efficiency,instead of finding exact nearestneighbors, we use an approxi-mate nearest-neighbor algo-rithm with a small tolerance.Once we’ve predicted the entireframe, we evaluate the errorfunction in Equation 4 with thisnewly created result. For any pix-els whose error is greater thansome tolerance, we refine theestimate by running the opti-mization. We then add thesenewly optimized pixels back intoour lookup table so we can con-tinually update our nonparamet-ric model.

With this approach, the num-ber of pixels to be optimized depends on the video’snature. If the video exhibits slow transitions, the non-parametric method should suffice. In case of abruptchanges, say due to camera motion, we’ll have to eval-uate more pixels using the time-consuming optimiza-tion method. In this latter case, after we add the frameto the model, the system is bootstrapped and can returnto computing subsequent frames primary from themodel. By varying the tolerance, we can trade accura-cy for speed, where a value of 0 would perform full opti-mization for each frame and a value of infinity wouldcompute results using only the nonparametric modelfrom the first k frames. An in-between value would bea mixture of the two, which we refer to as mixed opti-mization.

Alpha super-resolutionThe resolution of the three cameras currently limits

the resolution of the mattes pulled with defocus mat-ting, so high-resolution results require three high-reso-lution cameras. Although using high-resolution camerasmight seem reasonable, increases in computation time,memory and bandwidth usage, and camera cost canquickly make this option impractical. Using full opti-mization, the computation time for one 320 � 240 frameis about five to 10 minutes, which would put a 1-megapixel video at well over an hour per frame. Ournonparametric method will accelerate this process.However, when building a nonparametric model fromhigh-resolution images, the model’s size can becomeprohibitively large. Another concern is the limited band-width of cameras and their capture devices. Generally,cameras trade frame rate for increased resolution—atrade-off that’s nice to avoid if possible.

To combat these issues, we propose a method thatcomputes high-resolution alpha mattes by replacing theA-Cam’s pinhole camera with a high-resolution camera,leaving the other two cameras the same. We downsam-ple the high-resolution pinhole image, , to get IP tomatch the resolution of IF and IB. We then use IP, IF, andIB as before to compute mattes at 320 � 240 resolution

IPH


46 March/April 2007

2 Results from the new A-Cam: (a)pinhole image of the A-Cam, (b) back-ground-focused image, (c) foreground-focused image, (d) automaticallycomputed trimap, (e) the alpha matteat the A-Cam resolution, (f) the super-resolution alpha matte, and (g) super-resolution alpha premultipliedforeground image composited with anew background.

(b) (c)

(e) (f)

(g)

(d)

(a)

using our nonparametrically accelerated optimizationmethod. We then use a super-resolution technique as apostprocess to upgrade the alpha matte to the primarycamera’s native resolution.

Specifically, given � and F, as computed using themethod described in the previous sections, and given a blur function f, which models the resolution gap,where , we want to find �H and FH such thatf(�H) � �, and f(FH) � F. This is an underconstrainedproblem—there are many high-resolution � and Fimages that could blur to the low-resolution versions.Thus, we solve this problem using an edge-preservingregularization term that’s based on anisotropic diffu-sion. Researchers have used similar techniques forimage restoration, and the following derivation mirrorsthose used in these techniques (see Kornprobst et al.5

for a more detailed explanation of this derivation). Specifically, we want to optimize the following regu-

larized error function: J � (A�H � �)2 � � � � �(❘�(�H)❘)dxdy, where A is a matrix that encodes the blurfunction f, �H and � are given in vector form, �(x) is thegradient of x, and �(x) is an edge-preserving function,

. This regularization term lets us performsmoothing when gradient values are low; for high gra-dients, we can apply smoothing only along an edge andnot across it.

We can minimize this partial differential equation(PDE) using a variational method. Applying the Euler-Lagrange equation shows that the error function is min-imized when

Imposing Neumann boundary conditions, whichspecify that values of the solution’s gradient in the direc-tion normal to the boundary are zero, lets us convert thedivergence term to a Laplacian matrix, B, that is spatially varying as a function of the gradients of �H:

(5)

We then assume . This enforces thatedges in the original image are preserved in the sharp-ened alpha matte. This assumption is valid for depthedges, which dominate in the unknown region, becausegradients due to depth discontinuities correspond toedges in the alpha matte. Although this assumption isinvalid for gradients in IP that don’t appear in the alphamatte (for example, because of texture), this shouldn’tcause any errors as our regularization term is edge-preserving but shouldn’t introduce edges when none arepresent. We also note that our super-resolution methoddoesn’t rely on any new information relative to our orig-inal matting construct. The gradients we use for super-resolution are the same needed for defocus matting tocompute an alpha matte. We perform super-resolutiononly on alpha in the unknown region. Equation 5 nowbecomes:

(6)

We can then solve for �H in one step using a sparselinear least-squares solver. Observe that our regulariza-tion depends on the edges of and not the edges of�H. This is where our method deviates from similarimage restoration methods. In those methods, the reg-ularization matrix B is a function of the edges of theimage being solved for, which in this case is �H. Thisgives rise to a nonlinear PDE that must be solved usingan iterative approach. Our solution is linear, as B is afunction of a known value and thus can be solvedin one step using linear least-squares. We use the sameprocess to obtain FH, the high-resolution foreground.Performing super-resolution for the entire image at oncecan be problematic for large images—even though thematrices in Equation 6 are sparse, they can be still bequite large. However, because the blurring that we’retrying to remove is a relatively local function, we canperform super-resolution on a block-by-block basisrather efficiently even for large images. We’ve empiri-cally determined that 60 pixel subblocks that overlap byseven pixels work well for our scenes.

Off-center alpha mattingThe A-Cam uses a collection of beam splitters placed

between the cameras and the scene. Although thismight be acceptable in some cases, it might not bedesirable for high-end photography or for filmingmovies where the user might not want to let any opti-cal device, such as a beam splitter, come between thecamera and the scene. We propose a hybrid cameraapproach6 in which we use the A-Cam as an accessoryto an external off-center camera, and the cameras don’tshare the same center of projection, as Figure 1bshows. In this situation, we can’t use a matte from theA-Cam directly as a matte for the off-center camera.Instead, we compute the alpha matte directly for theoff-center camera and use the data from the A-Cam toregularize the solution.

The off-center camera and the A-Cam observe thesame scene, albeit with potentially different camerasettings. Still, we can assume that they’re both focusedon the foreground object and that we can colorimetri-cally and geometrically align the cameras. Given thisinformation, how can we use the A-Cam data to regu-larize the alpha matte we compute for the off-centercamera?

Regularized alpha mattingThe external off-center camera IOFF gives us one equa-

tion with seven unknowns and three constraints:

IOFF � �OFFFOFF � (1 � �OFF)BOFF.

If the A-Cam and primary camera are colormetrical-ly and geometrically aligned and the foreground objectis in focus in both cameras, we can assume that �OFF � �and FOFF � F. This reduces the number of unknowns byfour. This gives us a set of four equations with 10unknowns and 12 constraints:

IPH ,

IPH

A A B ATPH H TI+ =λ φ α α′ ∇( )⎛

⎝⎜⎞⎠⎟

⎛⎝⎜

⎞⎠⎟

∇ ∇( ) ( )αHPHI=

A A B AT H H H Tα λ φ α α α+ =′ ∇( )⎛⎝⎜

⎞⎠⎟

⎛⎝⎜

⎞⎠⎟

A AT H H

H

Hdivα + λ ∇

∇

∇′ ( )⎛⎝⎜

⎞⎠⎟

( )( )

⎛

⎝

⎜⎜⎜

⎞

⎠

⎟⎟⎟

φ αα

α== ATα

φ = / +x x( ) 1 1 2

f I IPH

P( )=


■ IP � �F � (1 � �)B■ IB � (�F) � disk(rF) � (1 � � � disk(rF))B■ IF � �F � (1 � �)(B � disk(rb))■ IOFF � �F � (1 � �)BOFF

Just as before, we can’t solve this problem on a pixel-by-pixel basis because we convolve the images IB and IF

with a finite-size disk. Instead, we solve this system ofequations using regularized optimization. Of all thesolutions that satisfy the primary camera’s mattingequation, we choose the one that also satisfies the alphamatting equation of the A-Cam. Specifically, let J � (IP

� ÎP)2 � (IB � ÎB)2 � (IF � ÎF)2 be the error term for the A-Cam, where ÎP, ÎB, and ÎF are the pinhole, background,and foreground images recovered by the optimization,respectively, and let JOFF � (IOFF � ÎOFF)2 be the error termfor the high-definition camera, where ÎOFF is the recov-ered off-center camera image. We solve

, (7)

where � is the regularization parameter controlled bythe user. We’ve empirically found that setting � between0.3 and 0.5 works well.

Although we assume that the off-center camera andthe A-Cam can share � and F, this isn’t the case with thebackground. This is because the background B and thebackground BOFF are related by a homography if they’replanar, or by pixel-wise correspondence if they aren’t.In addition, they might be related by a convolution (ifthey’re defocused differently because of different depthsof field of the lenses in the A-Cam and off-center cam-era). We avoid estimating these relationships by direct-ly solving both for B and BOFF. We adapt McGuire et al.’s2

minimization method to minimize the function in Equa-tion 7 by adding the additional unknown, BOFF, and errorterm, JOFF, to the error function and gradient computa-tion and by incorporating the � regularization factor;the minimization method is otherwise unchanged.We’ve similarly adapted our nonparametric accelera-tion method to incorporate this additional data.

Off-center calibrationColorimetrically and geometrically calibrating the

primary camera and the A-Cam takes several steps.We achieve color calibration through simple his-

togram matching, as described earlier. For geometriccalibration, we assume that we know the foregroundobject’s depth and place a grid at that location ahead oftime. (We address this assumption later.) We use corre-sponding points from this grid to compute a homogra-phy between the primary camera and the A-Cam.

Because this initial homography aligns features atonly one depth and a typical foreground object spans arange of depths, we must perform additional alignmentbetween the off-center camera and the A-Cam. Oneapproach is to use pixel-wise optical flow. However,because optical flow tends to fail along object borders,we instead align the foreground by computing homo-graphies on a block-by-block basis. Because we solve fordifferent backgrounds for the A-Cam and off-centercamera, we need to align only the foreground pixels. We

do this by computing homographies to best align onlythe foreground pixels, as determined by our trimap. Wealign the background pixels in a block with the sametransform as the foreground pixels. This block-basedalignment preserves the structure of the foregroundobject in the border region by providing a rigid trans-form across the boundary. Our method works well aslong as the object boundary lies within the block. Smallseams can exist if the object boundary lies across a blockboundary, so we size our blocks to approximately matchthe unknown region’s size to minimize this situation’soccurrence. We’ve empirically determined that 60 � 60-pixel blocks work well for our scenes.

Computing these homographies requires estimatingeight parameters per homography. Fortunately, we canrely on the fact that the space of all homographiesbetween a pair of images is a 4D space.7 Intuitively, thisis true because a plane that defines the homography hasfour degrees of freedom. We therefore image a planargrid in 10 positions and compute the homography foreach position. For these 10 homographies, we use prin-cipal components analysis (PCA) to find the four basishomographies spanning the space of all homographiesbetween the two cameras. Then, for every pixel block,we find a homography that minimizes the sum-of-squared differences between the aligned pixel blocks.We need to solve for only four unknowns�the four coef-ficients of the basis homographies�which we do withan iterative Levenberg-Marquardt solver.

We return now to the assumption that we know theforeground object’s depth ahead of time. This need notbe the case. After we recover the four basis homographymatrices between the off-center camera and the A-Cam,we can represent any homography, including the onecaused by the plane going through the foregroundobject, as a linear combination of the basis matrices. Wecan estimate the foreground object’s depth by estimat-ing the plane that minimizes the image differencebetween the off-center camera and the A-Cam. Theplane estimation involves estimating only four parame-ters (the coefficients of the basis homographies) and notthe eight parameters of a homography. Note that evenif the off-center camera moves, as long as the A-Camremains rigidly attached to it, these basis homographiescan continue to be used for alignment.

ExperimentsWe conducted several experiments to validate our

methods. We first show results using our nonparametricacceleration method. We then show results that illustrateour super-resolution method and follow those with resultsfrom off-center matting. In addition to visual results, weprovide an analysis of the running times of our extensions.

Figure 3 compares results from our nonparametricmethod with a result from the original defocus mattingpaper. We captured this data set, of a person with blondhair, using three 640 � 480 (Bayer pattern) video cam-eras. As in McGuire et al.,2 we demosaiced the data to320 � 240 resolution for computing mattes. For thissequence, we used the first five frames for training themodel, and here show the result of the 22nd frame. Fora comparison of the longer video sequence, see our

argmin, , ,a F B B OFF

OFF

J J+λ λ{ } ≤ ≤0 1


48 March/April 2007

video results at http://graphics.ucsd.edu/papers/exploring_defocus_matting. Our mixed optimizationapproach produces a result with comparable quality tothe original method in little less than half the time (3.5minutes per frame compared to 8.5 minutes with theoriginal optimization process).

Figure 3 also shows our super-resolution approachapplied to this same data set. For the first result, we pushour super-resolution method and downsample the A-Camdata to 160 � 120 and then compute the matte at this lowresolution. We then upgrade these mattes to 640 � 480resolution. In the second result, we upgrade a 320 � 240result to 640 � 480. We show the original low-resolutionversions and a result computed natively at 640 � 480 forcomparison. Both super-resolution results look sharperand more detailed than the original optimization output,and the upgraded 160 � 120 result (Figure 3f) and theupgraded 320 � 240 result (Figure 3h) are on par withthe result computed natively at 640 � 480 (Figure 3i).

Although the results with this original data set are rea-sonable, some errors are still present. These errors arebecause of poor geometric alignment and saturation of

the foreground- and background-focused cameras. Ournonparametric method doesn’t introduce these errors—similar artifacts are present in McGuire et al.’s results.2

It’s the presence of these errors that lead us to changethe beam splitters and build an accurately designed casefor holding the camera and beam splitter assembly.

Figure 2 shows a result for a new data set acquiredwith our improved A-Cam setup. We show the A-Camimages, resulting matte and composite, and super-res-olution from 320 � 240 to 640 � 480. The images showa stuffed bear in front of some wood boxes and artificialtrees. The improved alignment from our new A-Camhardware setup lets us pull a much cleaner and accuratematte, even though the background in this scene hascomplex structure and high-frequency content.

Our final set of results illustrates off-center matting.We attached the A-Cam to an external off-center cam-era, as Figure 1b shows. We captured images of a gridin several positions in the filming area and used themto compute the basis homographies between the off-center camera and the A-Cam, as we described earlier.Figure 4 shows an off-center matting result for a stuffed


3 Nonparametrically accelerated optimization and super-resolution: (a) pinhole image of the A-Cam, (b) the alphamatte at 320 � 240 using full optimization (that is, McGuire et al.’s method2), (c) the alpha matte at 320 � 240using nonparametrically accelerated optimization, (d) composite of 320 � 240 nonparametrically acceleratedoptimization result. Alpha mattes for the region shown in the green box computed at (e) 160 � 120 using fulloptimization, (f) 160 � 120 upgraded to 640 � 480, (g) 320 � 240 using full optimization, (h) 320 � 240 upgraded to640 � 480, and (i) 640 � 480 using full optimization.

(d)

4 Off-center matting: (a) original off-center camera image, (b) the alpha matte at the A-Cam resolution, (c) the super-resolution alphamatte, and (d) super-resolution alpha premultiplied foreground image composited with a blue background.

(c)(a) (b)

(e) (f) (g) (h) (i)

(d)(c)(a) (b)

gorilla in front of some wood boxes. In this data set,there is a relatively large (5�) resolution differencebetween the off-center camera and the A-Cam because

the off-center camera is a two-megapixel consumer cam-era. This object spans a relatively large depth range, thusour per-block geometric alignment was essential. Wesuccessfully pulled the alpha matte and, as Figure 5shows, recovered a significant amount of detail usingour super-resolution method.

Figure 6 shows a single frame of a video sequence of anoff-center matting result for a person with black hair in anoffice supply room. We captured this data set using a 640�480 video camera mounted next to the A-Cam. We com-puted mattes at 320 � 240 resolution and then upgradedto 640 � 480 using our super-resolution method. Figure 7shows a side-by-side comparison of a result using full opti-mization, mixed optimization, and the nonparametricmodel alone. We computed the fully optimized result in sixminutes, the mixed optimization result in 1.5 minutes,and the nonparametric result in 30 seconds. We used thefirst five frames of a 60-frame sequence for training. Videoresults for this data set are available at http://graphics.ucsd.edu/papers/exploring_defocus_matting.

Table 1 compares running times of our two video datasets processed with and without our extensions. Thisdata demonstrates that our pure nonparametricapproach provides a reasonably significant speedup overthe full optimization approach. A mixed optimizationstill produces a moderate speedup. The exact value toset for for a mixed approach depends on the desiredbalance of speed versus quality and also on the overalldata quality�that is, some inherent lower bound on theerror exists, given noise in the data, calibration errors,and so on. Thus, if we set below this lower bound, it’sequivalent to specifying full optimization. The units of are mean-squared-difference of image intensity of theobserved and predicted images. For the data sets shownhere, we set equal to 0.003, which corresponds toaround a 5-percent error in image intensity.

Although we initially envisioned the super-resolutionmethod as a way to match mixed-resolution cameras,our method can also serve as an acceleration techniquewhen all cameras have the same resolution. First, wedownsample all videos to a low resolution, then com-pute an alpha matte, and then upgrade back to the orig-inal resolution. Although this might not provide exactlythe same quality as optimizing at the A-Cam’s native res-olution, for some applications the speed increase mightjustify the slight loss of quality.

Discussion and future workOur novel defocus matting extensions increase speed

and resolution and can be used with traditional con-sumer cameras; however, some limitations remain. Ournonparametric method’s two primary limitations aretypical of many learning methods.

The first limitation is that errors present in the train-ing set will propagate through to prediction. Currently,we pick our training set by simply using the first fiveframes in a sequence. By using this scheme, we let thequality of the optimization for these initial frames dic-tate our training data’s quality, but we don’t have anyway to ensure that these first five frames are actuallygood examples. Furthermore, we pick entire frames fortraining. One straightforward modification would be to


50 March/April 2007

6 Off-center video matting with nonparametric acceleration and super-resolution: (a) original off-center camera image, (b) the alpha matte com-puted at 320 � 240 resolution using our mixed optimization method, (c)the 640 � 480 super-resolution alpha matte, and (d) super-resolution alpha-multiplied foreground image composited with a blue background.

(a)

7 Nonparametric video matting of the area in the green box in Figure 6b:(a) alpha matte computed using full optimization (5.5-minute computationtime), (b) alpha matte computed using our mixed optimization method(1.5 minutes), and (c) alpha matte computed using only the nonparametricmodel (30 seconds).

(b) (c)

(a) (b) (c)

5 Gorilla with alpha super-resolution: (a) original image, (b) upsamplingusing bicubic interpolation, and (c) super-resolution using our edge-pre-serving method.

(a) (b)

(c) (d)


include training data on a per-pixel basis only where theabsolute error from the optimization is low. A furthermodification would be to train a classifier to automati-cally label “good” and “bad” results.

The nonparametric method’s second limitation is thememory used by the model, which grows as we update itwith new frames. Our current Matlab implementation hasan upper limit of about 60 added frames. This numbervaries, however, with the threshold and the extent of thevideo’s temporal coherence�that is, the more the framesare alike, the smaller the model can be. One solution to thisproblem is to compress the model. Recently, researchershave used canonical correlation analysis to reduce non-parametric model size. We’re interested in exploring this asan alternative to the traditional PCA-based techniques.

A limitation of our super-resolution method is that it canenhance noise in the input low-resolution alpha matte.Occasionally a small, relatively low, errant alpha valuebecomes more prominent after super-resolution. In prac-tice, a small amount of postprocessing2 will often removethese errors before super-resolution. Decoupling our super-resolution method from thematte-optimization proce-dure might also introduceerrors relative to computingthe matte at high resolutionwith an integrated super-resolution method. Ourmotivation was speed, andwe considered this addition-al source of error acceptable.However, one future area ofwork is to merge these pro-cesses without the perfor-mance hit of full optimizationat high resolution. Anotherapproach for improving ourmethod is to combine it witha nonparametric approachsimilar in principle to ouracceleration method. Thisseems like a promising direc-tion because example-basedmethods have been used suc-cessfully for super-resolutionof traditional images.

Our off-center method is currently limited to workwith scenes containing predominantly planar fore-ground and background objects. Although we can han-dle some amount of depth variation with our alignment,this is, in general, a limitation. One way to address thisis by using more sophisticated alignment techniques thatleverage the fact that the off-center primary camera andthe pinhole camera essentially provide a stereo pair.Also, although the resolution gap is the most obviousdistinction between the consumer-level camera and theA-Cam, other camera setup gaps might be considered,such as a gap in the dynamic range of the two cameras.

While we’ve addressed several limitations of defocusmatting, other inherent limitations remain. The prima-ry limitation is the presence of local minima in the error-function minimized by nonlinear optimization. The finalresults can be sensitive to the initial guess and the choiceof the regularization parameters involved in the process.Thus, the process can often terminate in a local mini-mum. Figure 8 shows an example of this type of failure.In this result, the optimization terminated with an incor-

Table 1. Per-frame computation times.

Data set Method Computation Super-resolution Total time for of alpha (minutes) 640 � 480 (minutes) 320 � 240 640 � 480 (minutes)

Blond hair 160 � 120 (full optimization) 1 0.5 4 5 320 � 240 (full optimization) 8.5 N/A 2 10.5 320 � 240 (mixed optimization) 3.5 N/A 2 5.5 320 � 240 (non-parametric only) 0.5 N/A 2 2.5 640 � 480 (full optimization) 30 N/A N/A 30

Black hair 320 � 240 (full optimization) 5.5 N/A 1.5 7 320 � 240 (mixed optimization) 1.5 N/A 1.5 3 320 � 240 (non-parametric only) 0.5 N/A 1.5 2

8 A failure case: (a) pinhole image, (b) foreground focused image, (c) background focused image, (d) the foreground color recovered by the optimization, (e) the recovered background, and (f) thecomputed alpha matte. Because of the large amount of defocus and lack of significant gradients in the background, there is foreground pollution in the background and vice versa, resulting in an overlysmooth outline for the matte and incorrect values above the bear’s head and on its left shoulder.

(b) (c)

(e) (f)(d)

(a)

rect low-frequency blur of the foreground color pollut-ing the background color. Because of the large amountof defocus and the relatively smooth background, thiserror really only violates the pinhole constraint, but islow error given the remaining constraints.

Our nonparametric acceleration method provides abetter initial guess than the original process of inpaint-ing the unknown areas and estimating the initial alpha,but it doesn’t remove the need for some form of initialguess for generating the training frames. Improving theinitial guess (for example, by using single-frame meth-ods such as Poisson matting) could be a fruitful direc-tion for future work. ■

AcknowledgmentsWe thank John Barnwell for his help building the A-

Cam and Amit Agrawal for his insight on edge-preserv-ing regularization. We also thank the anonymousreviewers for their comments. We completed the major-ity of this work while Neel Joshi was an intern at Mit-subishi Electric Research Labs; he was additionallyfunded by US National Science Foundation grant DGE-0333451.

References1. N. Joshi, W. Matusik, and S. Avidan, “Natural Video Mat-

ting Using Camera Arrays,” ACM Trans. Graphics, vol. 25,no. 3, July 2006, pp. 779-786.

2. M. McGuire et al., “Defocus Video Matting,” ACM Trans.Graphics, vol. 24, no. 3, Aug. 2005, pp. 567-576.

3. A. Hertzmann et al., “Image Analogies,” Proc. ComputerGraphics (Siggraph), ACM Press, 2001, pp. 327-340.

4. B.S. Reddy and B.N. Chatterji, “An FFT-based Techniquefor Translation, Rotation, and Scale-Invariant Image Reg-istration,” IEEE Trans. Image Processing, vol. 5, no. 8, 1996,pp. 1266-1271.

5. P. Kornprobst, R. Deriche, and G. Aubert, “Nonlinear Oper-ators in Image Restoration,” Proc. IEEE Computer Soc. Conf.Computer Vision and Pattern Recognition (CVPR), IEEE CSPress, 1997, pp. 325-330.

6. H.S. Sawhney et al., “Hybrid Stereo Camera: An IBRApproach for Synthesis of Very High Resolution Stereo-scopic Image Sequences,” Siggraph: Proc. 28th Ann. Conf.Computer Graphics and Interactive Techniques, ACM Press,2001 pp. 451-460.

7. A. Shashua and S. Avidan, “The Rank-4 Constraint in Mul-tiple View Geometry,” Proc. European Conf. ComputerVision (ECCV), Springer-Verlag, 1996, pp. 196-206.

For further information on this or any other computingtopic, please visit our Digital Library at http://www.computer.org/publications/dlib.

Neel Joshi is a PhD student in theComputer Science and EngineeringDepartment at the University of Cali-fornia, San Diego. His research inter-ests include computer graphics andcomputer vision, specifically, data-dri-ven graphics, image-based rendering,inverse rendering, appearance mod-

eling, and computational photography and video. Joshihas an MS in computer science from Stanford University.Contact him at [email protected].

Wojciech Matusik is a research sci-entist at Mitsubishi Electric ResearchLabs. His primary research lies in com-puter graphics, data-driven modeling,computational photography, and newdisplay technologies. Matusik has aPhD in electrical engineering and com-puter science from the Massachusetts

Institute of Technology. Contact him at [email protected].

Shai Avidan is a research scientistat Mitsubishi Electric Research Labs.His research interests are in the com-puter vision field with occasionaldetours into computer graphics andmachine learning. Avidan has a PhDin computer science from the HebrewUniversity, Jerusalem. Contact him at

[email protected].

Hanspeter Pfister is a seniorresearch scientist at Mitsubishi Elec-tric Research Labs. His research is atthe intersection of visualization, com-puter graphics, and computer vision.Pfister has a PhD in computer sciencefrom the State University of New Yorkat Stony Brook. He is chair of the IEEE

Visualization and Graphics Technical Committee (VGTC).Contact him at [email protected].

William T. Freeman is a professorof electrical engineering and comput-er science at the Massachusetts Insti-tute of Technology’s Artificial Intelli-gence Lab. His research interestsinclude machine learning applied tocomputer vision and computer graph-ics, Bayesian models of visual percep-

tion, and interactive applications of computer vision.Freeman has a PhD from the Massachusetts Institute ofTechnology, where he studied computer vision. Contacthim at [email protected].


52 March/April 2007

Exploring Defocus Matting -...

Documents

Transcript of Exploring Defocus Matting -...