Reconstructing Dense Light Field From Array of Multifocus Images

11
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007 269 Reconstructing Dense Light Field From Array of Multifocus Images for Novel View Synthesis Akira Kubota, Member, IEEE, Kiyoharu Aizawa, Member, IEEE, and Tsuhan Chen, Senior Member, IEEE Abstract—This paper presents a novel method for synthesizing a novel view from two sets of differently focused images taken by an aperture camera array for a scene consisting of two approximately constant depths. The proposed method consists of two steps. The first step is a view interpolation to reconstruct an all-in-focus dense light field of the scene. The second step is to synthesize a novel view by a light-field rendering technique from the reconstructed dense light field. The view interpolation in the first step can be achieved simply by linear filters that are designed to shift different object regions separately, without region segmentation. The pro- posed method can effectively create a dense array of pin-hole cam- eras (i.e., all-in-focus images), so that the novel view can be synthe- sized with better quality. Index Terms—Blur, image-based rendering (IBR), light-field rendering (LFR), spatial invariant filter, view interpolation. I. INTRODUCTION M OST view synthesis methods using multiview images involve the problem of estimating scene geometry [1]. A number of methods have been investigated for solving this problem (e.g., stereo matching [2]); however, it is still difficult to obtain accurate geometries for arbitrary real scenes. Using an inaccurate geometry for the view synthesis would induce visible artifacts in the synthesized novel image. As an alter- native to such a geometry-based approach is image-based ren- dering (IBR) [3], [4], which has been studied for view synthesis. IBR does not need to estimate the scene geometry and enables synthesis of photo-realistic novel images, independent of the scene complexity. The idea of IBR is to sample light rays in the scene by capturing multiple images densely enough to create novel views without aliasing artifacts through resampling of the sampled light rays [5]–[7]. The sampling density (i.e., camera spacing density) required for nonaliasing resampling is imprac- tically high, however. To reduce the required sampling density, some geometric information is needed [8]. One problem is that, in practice, it is difficult to realize both of these approaches, Manuscript received November 3, 2005; revised February 12, 2006. The as- sociate editor coordinating the review of this manuscript and approving it for publication was Dr. Benoit Macq. A. Kubota is with the Department of Information Processing, Tokyo Institute of Technology, Yokohama226-8502, Japan (e-mail: [email protected]). K. Aizawa is with the Department of Electrical and Electronic Engineering, University of Tokyo, Tokyo 113-8656, Japan (e-mail: [email protected]. jp). T. Chen is with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890 USA (e-mail: [email protected]). Color versions of Figs. 1, 2, 4–8, and 10–14 are available online at http:// ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2006.884938 that is, accurately obtaining the scene geometry and densely capturing light rays, without special equipment, such as a laser scanner and plenoptic camera [9]. Therefore, a more practical and effective solution for solving this problem is required. In most recently presented IBR methods, to solve this problem, research has mainly concentrated on how to accu- rately obtain the geometric information to reduce the required sampling density. One approach is to find feature correspon- dences between reference images. This is a conventional approach, but it has recently been improved for application to IBR.With this approach, high-confidence feature points that are required to provide a synthesized view of sufficient quality can be detected and matched. Aliaga et al. [10] presented a robust feature detection and tracking method in which poten- tial features are redundantly labeled in every image and only high-confidence features are used for matching. Siu et al. [11] proposed an image registration method between a sparse set of three reference images, allowing feature matching in a large search area. They introduced an iterative matching scheme based on the confidence level of the extracted feature points in terms of topological constraints that hold between triangles composed of three feature points in the reference images. Another approach is to estimate a view-dependent depth map at a novel viewpoint based on the color consistency of cor- responding light rays or pixel values between the reference images, adopting a concept used in volumetric techniques [12]–[14], such as space sweeping for scene reconstruction. In this approach, the color consistency is measured at every pixel or selected pixels with respect to different hypothetical depths in the scene, and the depth value is estimated to be the depth that gives the highest consistency. The average of the color values with the highest consistency then is rendered at the pixel of the novel image. This approach is very suitable for real-time processing [15]–[18]. In this paper, we present a novel IBR approach that differs from the conventional ones. The concept of our approach is illustrated schematically in Fig. 1. We deal with a scene con- sisting of foreground and background objects at two approxi- mately constant depths, and we capture two sets of images with different focuses, as reference images, using a 1-D array of real aperture cameras. Unlike the conventional methods using pin-hole cameras, our proposed method uses aperture cameras to capture differently focused images with each camera: one focused on the foreground and the other focused on the back- ground. Our proposed method consists of two steps. In the first step, we interpolate intermediate images that are focused on both objects at densely sampled positions between all of the capturing cameras. In the second step, we synthesize a novel 1057-7149/$25.00 © 2006 IEEE

Transcript of Reconstructing Dense Light Field From Array of Multifocus Images

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007 269

Reconstructing Dense Light Field From Array ofMultifocus Images for Novel View Synthesis

Akira Kubota, Member, IEEE, Kiyoharu Aizawa, Member, IEEE, and Tsuhan Chen, Senior Member, IEEE

Abstract—This paper presents a novel method for synthesizing anovel view from two sets of differently focused images taken by anaperture camera array for a scene consisting of two approximatelyconstant depths. The proposed method consists of two steps. Thefirst step is a view interpolation to reconstruct an all-in-focus denselight field of the scene. The second step is to synthesize a novelview by a light-field rendering technique from the reconstructeddense light field. The view interpolation in the first step can beachieved simply by linear filters that are designed to shift differentobject regions separately, without region segmentation. The pro-posed method can effectively create a dense array of pin-hole cam-eras (i.e., all-in-focus images), so that the novel view can be synthe-sized with better quality.

Index Terms—Blur, image-based rendering (IBR), light-fieldrendering (LFR), spatial invariant filter, view interpolation.

I. INTRODUCTION

MOST view synthesis methods using multiview imagesinvolve the problem of estimating scene geometry [1].

A number of methods have been investigated for solving thisproblem (e.g., stereo matching [2]); however, it is still difficultto obtain accurate geometries for arbitrary real scenes. Usingan inaccurate geometry for the view synthesis would inducevisible artifacts in the synthesized novel image. As an alter-native to such a geometry-based approach is image-based ren-dering (IBR) [3], [4], which has been studied for view synthesis.IBR does not need to estimate the scene geometry and enablessynthesis of photo-realistic novel images, independent of thescene complexity. The idea of IBR is to sample light rays in thescene by capturing multiple images densely enough to createnovel views without aliasing artifacts through resampling of thesampled light rays [5]–[7]. The sampling density (i.e., cameraspacing density) required for nonaliasing resampling is imprac-tically high, however. To reduce the required sampling density,some geometric information is needed [8]. One problem is that,in practice, it is difficult to realize both of these approaches,

Manuscript received November 3, 2005; revised February 12, 2006. The as-sociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Benoit Macq.

A. Kubota is with the Department of Information Processing, Tokyo Instituteof Technology, Yokohama 226-8502, Japan (e-mail: [email protected]).

K. Aizawa is with the Department of Electrical and Electronic Engineering,University of Tokyo, Tokyo 113-8656, Japan (e-mail: [email protected]).

T. Chen is with the Department of Electrical and Computer Engineering,Carnegie Mellon University, Pittsburgh, PA 15213-3890 USA (e-mail:[email protected]).

Color versions of Figs. 1, 2, 4–8, and 10–14 are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2006.884938

that is, accurately obtaining the scene geometry and denselycapturing light rays, without special equipment, such as a laserscanner and plenoptic camera [9]. Therefore, a more practicaland effective solution for solving this problem is required.

In most recently presented IBR methods, to solve thisproblem, research has mainly concentrated on how to accu-rately obtain the geometric information to reduce the requiredsampling density. One approach is to find feature correspon-dences between reference images. This is a conventionalapproach, but it has recently been improved for application toIBR.With this approach, high-confidence feature points thatare required to provide a synthesized view of sufficient qualitycan be detected and matched. Aliaga et al. [10] presented arobust feature detection and tracking method in which poten-tial features are redundantly labeled in every image and onlyhigh-confidence features are used for matching. Siu et al. [11]proposed an image registration method between a sparse set ofthree reference images, allowing feature matching in a largesearch area. They introduced an iterative matching schemebased on the confidence level of the extracted feature pointsin terms of topological constraints that hold between trianglescomposed of three feature points in the reference images.Another approach is to estimate a view-dependent depth mapat a novel viewpoint based on the color consistency of cor-responding light rays or pixel values between the referenceimages, adopting a concept used in volumetric techniques[12]–[14], such as space sweeping for scene reconstruction. Inthis approach, the color consistency is measured at every pixelor selected pixels with respect to different hypothetical depthsin the scene, and the depth value is estimated to be the depththat gives the highest consistency. The average of the colorvalues with the highest consistency then is rendered at the pixelof the novel image. This approach is very suitable for real-timeprocessing [15]–[18].

In this paper, we present a novel IBR approach that differsfrom the conventional ones. The concept of our approach isillustrated schematically in Fig. 1. We deal with a scene con-sisting of foreground and background objects at two approxi-mately constant depths, and we capture two sets of images withdifferent focuses, as reference images, using a 1-D array ofreal aperture cameras. Unlike the conventional methods usingpin-hole cameras, our proposed method uses aperture camerasto capture differently focused images with each camera: onefocused on the foreground and the other focused on the back-ground. Our proposed method consists of two steps. In the firststep, we interpolate intermediate images that are focused onboth objects at densely sampled positions between all of thecapturing cameras. In the second step, we synthesize a novel

1057-7149/$25.00 © 2006 IEEE

VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight

270 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 1. Concept of our approach. Assuming a scene containing two objectsat different depths, we capture near- and far-focused images at each cameraposition of a 1-D aperture camera array. In the first step, we densely interpolateall-in-focus intermediate images between the cameras by linear filtering of thecaptured images. In the second step, we apply LFR to synthesizing novel viewsusing the dense light-field data obtained in the first step.

view by light-field rendering (LFR) [6] using the intermediateimages (i.e., the dense light-field data) obtained in the first step.The view synthesis in the second step can be easily achievedwith adequate quality only if, in the first step, the light-fielddata are obtained with sufficient accuracy and density to achievenonaliasing LFR.

This paper mainly addresses the problem of view inter-polation between the capturing cameras in the first step andpresents an efficient view interpolation method using spatiallyinvariant linear filters. The reconstruction filters we present inthis paper make it possible to shift each object region accordingto the parallax required for the view interpolation, withoutregion segmentation. For the simple scene we assume here, theconventional vision-based methods such as stereo-matching [2]or depth from focus/defocus [19]–[21] can estimate the scenegeometry, but they depend on the scene complexity and havea high computational load. In contrast, our approach needsonly filtering and is much simpler than such vision-based ap-proaches. In addition, it is independent of the scene complexityas long as the scene consists of two approximately constantdepths.

II. RECONSTRUCTING DENSE LIGHT FIELD

A. Problem Description

In the first step of our method, we interpolate intermediateimages densely between every camera pair. The view interpola-tion problem we deal with here is illustrated in Fig. 2. Our goalis to generate an all-in-focus intermediate image that would becaptured at an arbitrary position (at the virtual camera in Fig. 2)between the adjacent cameras and (where is referredto as the camera index number) nearest to the view position. We

Fig. 2. View interpolation problem in the first step of our approach. We inter-polate an all-in-focus intermediate image, f , that would be seen from a virtualcamera position between the two nearest cameras. We use four reference imagesfor this interpolation: the near-focused image at C , g ; the far-focused imageat C , g ; the near-focused image at C , g ; and the far-focused image atC , g .

use the four reference images captured with the cameras: theimage at the camera focused on the foreground; the image

at focused on the background; the image at fo-cused on the foreground; and the image at focusedon the background (see example in Fig. 6). The view positionof the intermediate image is represented as an intermediate po-sition between the cameras, parameterized by for ,and the distance between the cameras is . We assume that thefocal lengths of all cameras, including the virtual camera, arethe same.

B. Imaging Model

We model the four reference images and the desired interme-diate images by a linear combination of foreground and back-ground textures.

Consider the model of two differently focused imagesand at camera . First, we introduce the foreground texture

and the background texture that are visiblefrom the camera ; we define them in the same way as thosein [22]

(1)

(2)

where is the ideal all-in-focus image that is supposedto be captured with camera , is the depth map at thecamera position, denoting the depth value corresponding to thepixel coordinate , and and are the depths ofthe foreground and the background objects, respectively. Notethat these textures and the depth map are unknown. The twodifferently focused images and can be modeled by thefollowing linear combination of the defined textures:

(3)

where is the point spread function (PSF) and denotesa 2-D convolution. We assume that the PSF can be modeled asa Gaussian function

(4)

VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD 271

where is an amount of blur, which is related to the corre-sponding standard deviation of the Gaussian function [23]:

.In (3), the same PSF can be used for both defocus regions be-

cause the amounts of blur in both regions become the same aftercorrection of image magnification due to the different focus set-tings (for details, see [24]). The PSF does not depend on thecamera position and can be commonly used for all the referenceimages, since we assume that the depths of the two objects areconstant for all camera positions. In addition, since the amountof blur is estimated in the preprocessing step using our previ-ously presented method [24], the PSF is given.

The linear imaging model in (3) is not correct for the oc-cluding boundaries, as reported in [25]. However, it has the ad-vantage that it enables us to use convolution operations, whichcan be represented by product operations in the frequency do-main, resulting in a simpler imaging model [see (8)]. Moreover,satisfactory results are obtained with this model, as shown byour experimental results in Section III. The limitations of thismodel and their effect on the quality of the intermediate imageswill be tested on real acquired images in Section IV.

For modeling the all-in-focus intermediate image, we haveto consider two things: how to model the parallax and howto model the occluded background texture. To model the par-allax, we use a combination of textures that are appropriatelyshifted according to the intermediate position, parameterized by

. When the textures and are used, the inter-mediate image, say , is modeled by

(5)

where and are, respectively, disparities of the foregroundand the background objects between the adjacent cameras.These disparities can be estimated in the preprocessing step.The shift amounts we have to apply to the foreground and thebackground textures are and , respectively.

Similarly, when the textures and areused, the intermediate image (say ) at the same position ismodeled by

(6)

where and are the shift amounts to beapplied to the foreground and the background textures, respec-tively.

To fill in the occluded background in either one of the twoimages and modeled by (5) and (6), we simply take theirweighted average using weighting values and forand , respectively, and finally model the desired intermediateimage as

(7)

C. View Interpolation With Linear Filters in the FrequencyDomain

Our goal is to generate the intermediate image in (7) fromthe four reference images, , , , and , modeled in(3). Note that the PSF and the disparities and are given,

but the textures , , , and are unknown. In this sec-tion, we derive reconstruction filters that can generate the de-sired image directly from the reference images without esti-mating the textures (i.e., without region segmentation).

First, consider the problem of generating in (5) fromand in (3). The same problem was already dealt with in ourprevious paper [22] in which we presented an iterative recon-struction in the spatial domain. In the present paper, we presenta much more efficient method using filters in the frequency do-main. We take the Fourier transform of (3) and (5) to obtainthose imaging models in the frequency domain, as follows:

(8)

and

(9)

where and denote the horizontal and vertical frequencies,respectively. We use capital-letter functions to denote theFourier transform of the corresponding small-letter functions.These imaging models in the frequency domain are simplerthan their spatial domain counterparts because the convolutionand shifting operations are transformed into product opera-tions. By eliminating and from (8) and (9), we derive thefollowing sum-of-products formula:

(10)where and can be considered as the frequency charac-teristics of linear filters that are applied to and in thefrequency domain, respectively. The forms of and are asfollows:

(11)

These filter values cannot be determined stably at(i.e., the DC component), since the denominator, , equals0 and both limit values of (11) diverge at the DC. This diver-gence causes visual artifacts in the interpolated image whenthere is noise or modeling error in low-frequency componentsof and . To avoid this problem, regularization is neededbased on a constraint or prior information about the intermediateimage. In this paper, we propose a frequency-dependent shiftingmethod that is designed to have shift amounts that gradually de-crease to zero at DC, as shown in Fig. 3. We do not interpolatethe DC component of the intermediate image. This is reason-able given that the low-frequency components, including DC,do not cause many visual artifacts in the image, even if they arenot shifted appropriately.

Let the amounts of frequency-dependent shifting on andfor be and , respectively. Using a Cosine function,

we formulate as follows:

(12)

where is a threshold frequency up to which the shift amountchanges frequency dependently. We formulate similarly to

VicMcG
Highlight

272 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 3. Frequency-dependent shift amount, �d . The shift amount is designedto gradually increase from zero at DC to �d at a preset frequency threshold� . For frequencies larger than � , it has a constant value of �d .

the above equation, but using instead of . By using thesefrequency-dependent shift amounts, we can make the recon-struction filters stable and design them as follows:

(13)

Note that both limit values of (13) converge to 0.5 at the DC.Second, consider the problem of generating , which is the

Fourier transform of (6), using and . Similarly to thefirst case, we can derive the following formula for generating

:

(14)

with the stable reconstruction filters and , which are de-signed to be

(15)

using the frequency-dependent shift amounts and . In thiscase, is formulated to be

(16)and is formulated in the same manner. Note that both limitvalues of (15) also converge to 0.5 at the DC.

Finally, by substituting the resultant (10) and (14) using thestable reconstruction filters in (13) and (15) into the Fouriertransform of (7), we derive the filtering method for generatingthe desired intermediate image in the frequency domain

(17)

Fig. 4. Block diagram of the proposed view interpolation method using filters.

By taking the inverse Fourier transform of , we can obtainthe all-in-focus intermediate image . This suggests that inter-polation of the desired intermediate image can be achievedsimply by linear filtering of the reference images. The recon-struction filters we designed in (13) and (15) consist of PSF andshifting operators (i.e., exponential functions) that are spatiallyinvariant; therefore, view interpolation without region segmen-tation is possible. We can form these filters because the amountof blur and disparities are known or can be estimated.

Fig. 4 shows a block diagram of the proposed view interpo-lation method based on (17). We use filtering in the frequencydomain for faster calculation. First, we take the Fourier trans-form (FT) of the four reference images and multiply them bythe reconstruction filters in (13) and (15). We then sum up allthe images after the multiplication and finally take the inverseFT of the summed image. It should be noted that region seg-mentation is not performed in this diagram.

Using the proposed view interpolation method, we can gen-erate an all-in-focus intermediate image at any position betweentwo adjacent cameras. Therefore, we can virtually create imagesidentical to those that would be captured with densely arrangedpin-hole cameras. The created set of images is considered as thedense light field, from which novel views from arbitrary posi-tions can be rendered with sufficient quality by LFR. This willbe demonstrated in Section III.

III. EXPERIMENT

A. Experimental Setup

Our experimental setup is shown in Fig. 5. We assume andcoordinates as a 2-D world coordinate system. In our exper-

iment, we captured ten sets of two differently focused images(20 reference images in total) at ten different positions on thehorizontal axis with the F-number fixed at 2.4. When cap-turing the reference images, we used a horizontal motion stageto move a single digital camera (Nikon D1) in the axis di-rection at equal intervals of mm from 0 to 72 mm. Letus refer the camera at each position as and .The test scene we set up in this experiment consisted of fore-ground objects (balls and a cup containing a pencil) and back-

VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD 273

Fig. 5. Configuration of scene and camera array used in the experiment.

ground objects (textbooks), located roughly at mmand mm on the axis, respectively.

B. Preprocessing Steps

As preprocessing steps, we needed three steps: 1) image reg-istration; 2) estimation of blur parameter, ; and 3) estimationof disparities, and . Image registration is needed to correctthe difference in displacements and magnification between thetwo differently focused images acquired at every camera posi-tion. The magnification is caused by the difference in focus be-tween the images. We used the image registration method pre-sented in [24], which is based on a hierarchical matching tech-nique. Parameter estimation in steps 2) and 3) can be performedvia camera calibration using captured images of a test chart.Once these parameters are estimated, we do not need to performthe estimation again and can use the same parameters for otherscenes so long as they consist of objects at the same two depths.In this experiment, we estimated these parameters by the fol-lowing image processing using the captured reference imagesof the test scene. This is because we used a single camera, nota camera array, and it was difficult to correctly set each cameraposition to the same position used during calibration.

For the blur estimation based on the captured images, we alsoused our previously presented method [24], which estimates theamount of blur in the foreground and background regions inde-pendently, say and , respectively. The basic idea used inthe method is as follows. A blurred version of the near-focusedimage was created by applying the PSF with a blur amount

and compared with the far-focused image to evaluate thesimilarity (here we used the absolute difference) between themat each pixel. After evaluating the similarity for various bluramounts , the blur amount that gave the maximum simi-larity at a pixel was estimated to be . In principle, this pixelbelongs to a region that is part of the foreground region. Finally,the estimated blur amounts were averaged in this region. Thesame idea was applied to estimating . Note that this blur es-timation could not precisely segment each region but could ef-fectively estimate the blur only in the regions with texture pat-terns, excluding uniform regions and object boundary regions(see [24]).

TABLE IESTIMATED AMOUNT OF BLUR AND DISPARITIES

Estimated blur amounts at every camera position, togetherwith their averages (ave.) and standard deviations (std.), areshown in Table I. The two estimated values were not exactlythe same for the most cases, and the estimated values of hadsome variation (their std. was 0.29 pixels). This is because ofestimation errors and depth variation on the surface of the fore-ground objects. Nevertheless, we could use the averaged valuesof all the estimates of and , namely 3.6 pixels, as the rep-resentative blur amount for forming the PSF . This approxi-mation was possible because the proposed view interpolationmethod is robust to blur estimation errors, which we confirmedexperimentally, as described in Section IV-B.

Disparity estimation was performed using a templatematching with subpixel accuracy. In this experiment, we man-ually specified a block region to be used as a template. Whenestimating the disparity of the foreground, , we specified aregion of a face pattern drawn at the center of the cup in thenear-focused image at , and we found the horizontalposition of the template that yielded the best approximationof the corresponding region in all other near-focused images,

for . When estimating the disparity of thebackground, , we specified the region containing the word“SIGNAL” on the textbook and carried out template matchingwith all other far-focused images. The estimated disparitiesand between adjacent cameras are shown in the last twocolumns in Table I. Since the maximum deviation was within0.5 pixels for both estimated values for all camera pairs, wedetermined the final estimates and to be the averagedvalues 18.5 and 11.4 pixels, respectively.

C. Reconstructed Dense Intermediate Images

We constructed the reconstruction filters using the estimatedparameters of blur amount and disparities described in the pre-vious subsection. Based on the proposed filtering method inFig. 4, we interpolated nine all-in-focus intermediate images be-tween every pair of adjacent cameras, for a total of 91 images,including the all-in-focus images at the same positions as thecameras (that is, and 1). The parameter was variedfrom 0 to 1 in equal increments of 0.1 for each interpolation.The threshold frequency was set to 0.01 for all experiments;this value was empirically determined. The same reconstruction

VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight

274 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 6. Examples of captured real images. (a) A near-focused image at X =

16 mm. (b) A far-focused image at X = 16 mm. (c) A near-focused image atX = 24 mm. (d) A far-focused image at X = 24 mm.

process was carried out in each color channel of the intermediateimages.

Examples of captured images are shown in Fig. 6(a) and (b)are, respectively, near-focused and far-focused images capturedwith camera at mm, and Fig. 6(c) and (d) are thosecaptured with camera at mm. These images are24-bit color images of 280 256 pixels. We observed differentdisparities in the regions of the foreground and background ob-jects; these were measured to be 18.9 pixels and 11.8 pixels, re-spectively, as shown in Table I. Although the distance betweenthe cameras was set to 8 mm, this was not small enough to in-terpolate the intermediate images with adequate quality usingLFR. This is because the difference between the disparities wasabout eight pixels, which is larger than the ideal maximum dif-ference of two pixels that is allowed for nonaliasing LFR [8].

Fig. 7 shows all-in-focus intermediate images interpolatedby our proposed method from the reference images in Fig. 6 for

, 0.2, 0.4, 0.6, 0.8, and 1, which correspond to the camerapositions of 16, 17.6, 19.2, 20.8, 22.4, and 24 mm, respectively.In all interpolated images, it can be seen that both foregroundand background regions appear in-focus and are properlyshifted. Expanded regions (64 64 pixels in size) includingoccluded boundaries in these images are shown in the rightcolumn of Fig. 7. At the occluded boundaries, although theimaging model we used is not correct, no significant artifactsare visible. Instead of artifacts, background textures (letters)that should have been hidden are partially visible because theobject regions are shifted and blended in our method. Thistransparency of the boundaries does not produce noticeableartifacts in the final image obtained in the novel view synthesis,seen in Fig. 10. Filling in the hidden background preciselyis generally a difficult problem even for state-of-the-art vi-sion-based approaches. Estimation errors in the segmentation

Fig. 7. Intermediate images interpolated by the proposed method from the fourreference images in Fig. 6.

causes unnaturally distorted or broken boundaries, whichwould be highly visible in a novel image sequence createdby successively changing the novel viewpoint. Our proposed

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD 275

Fig. 8. Example of epipolar plane images (EPIs) of the reference images and the densely interpolated images by the proposed method. Here, white regions in (a)and (b) indicate regions of value zero. The proposed method can generate a dense EPI in (c) from the sparse EPIs in (a) and (b) by linear filtering. (a) EPI of g .(b) EPI of g . (c) EPI of the interpolated f .

method and the conventional vision-based approaches bothneed a smoothing process on the boundaries to prevent visibleartifacts. In our approach, blending the textures serves thispurpose.

Fig. 8(a) and (b) show epipolar plane images (EPIs) con-structed from captured reference images and , respectively,for . Each horizontal line corresponds to a scan-line image (here is chosen) of each reference imageand its vertical coordinate indicates the corresponding cameraposition . These EPIs are very sparse and each EPI has onlyten scan lines, according to the number of cameras. An EPI ofthe intermediate images interpolated by the proposed method isshown in Fig. 8(c), where each horizontal line in the EPI cor-responds to a scan-line image of each interpolated image. Theinterpolated EPI is much denser (91 horizontal scan lines) com-pared with the EPIs of the reference images, and it containsstraight, sharp line textures. This indicates that the all-in-focusintermediate images were generated accurately by the proposedmethod. The slope of lines indicates the depth of the objects.The striped regions with the larger slope corresponds to the fore-ground regions and the striped regions with the smaller slope,which are partially occluded by the foreground regions, corre-sponds to the background regions. Some conventional view-in-terpolation methods adopting interpolation of EPI [26]–[28] at-tempt to detect stripe regions with the same slope. In contrast,our method does not need such region detection but simply spa-tial-invariant filtering, under the specific condition that two dif-ferently focused images are captured at each camera position fora scene with two depths.

D. Synthesizing a Novel View by LFR

This section describes the second step of our approach, i.e.,novel view synthesis by LFR using the interpolated intermediateimages obtained in the first step. We also show some experi-mental results.

LFR is performed based on pixel matching between theobtained intermediate images, assuming that the scene is aplane (see Fig. 9) [9]. Let us represent the interpolated imagesby at position for . For instance,

Fig. 9. LFR for synthesizing the novel image.

for corresponds to interpolatedbetween cameras and . Let a novel image be , theviewpoint , and the assumed plane .is considered as an intensity value of the light ray passingthrough the novel viewpoint and the pixel position. We find theintersection (indicated by in Fig. 9) of the novel light raywith the assumed plane and project it onto every imaging planeof the interpolated images to obtain the corresponding pixelpositions, that is, the corresponding light rays, from which wecan determine the two nearest light rays. The intensity value ofthe novel light ray is finally synthesized as a weighted averageof the intensity values of the two nearest light rays (sayand ) based on the following linear interpolation:

(18)where is the coordinate of the novel light ray at

, as shown in Fig. 9. The optimal depth of the plane [8] is

276 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 10. Examples of the novel views synthesized by LFR from the denselyinterpolated images. (X ; Z ) are the coordinates of the novel viewpoint alongthe horizontal and depth axes. In (a) and (b), a zooming effect is demonstratedby changing only the depth position of the viewpoint. In (c) and (d), a parallaxeffect is demonstrated by changing only the horizontal position of the viewpoint.(a) (X ;Z ) = (20;�300), (b) (X ;Z ) = (20;300), (c) (X ;Z ) =(20;100), and (d) (X ;Z = (70;100).

determined so that the novel image has the least artifacts dueto pixel mismatching over the entire image and is given by

(19)

Fig. 10 shows examples of synthesized novel view imagesfrom various viewpoints. The coordinate denotes thenovel viewpoint. In order to clearly see the perspective effectsof the novel view images, we changed the viewpoint along thehorizontal and depth axes independently. In Fig. 10(a) and (b),a zooming (close up) effect is demonstrated by changing onlythe depth position of the viewpoint when the horizontal positionwas fixed at 20 mm. We can see that the image in Fig. 10(b) is notmerely a magnified version of the image in Fig. 10(a); the fore-ground objects (the cup and pencil) were magnified more thanthe background object in Fig. 10(b). Fig. 10(c) and (d) demon-strates the parallax effect by changing only the horizontal posi-tion of the viewpoint. Different shifting (displacement) effectswere observed on the objects, and the hidden background tex-ture in one image was visible in the other image.

IV. DISCUSSION

We have shown that we can effectively interpolateall-in-focus intermediate images between two neighboringcameras simply by filtering the reference images capturedwith these cameras with different focuses. In this section, weexperimentally evaluate the performance of our view interpo-lation method in terms of its accuracy and robustness againstestimation errors in estimating the blur amount.

A. Performance Evaluation

We tested the accuracy of the proposed method for the caseof interpolating an image at the center of two camera positions.At each of the two camera positions, near- and far-focusedimages were captured with an F-number of 2.4 for the samescene used in the experiments. To measure the accuracy ofthe interpolated image, a ground truth image is required. Inthis test, we captured an image at the center of the camerapositions with a small aperture (F-number: 13) as the groundtruth image so that the image was focused on both regionsat different depths.

Fig. 11 shows the ground truth image and interpolated imagesobtained by the proposed method with for differentdistances between the cameras, namely, 4, 8, 12, and 16 mm.For comparison, we also generated the center image by LFRbased on the focal plane at the optimal depth of 1230 mm cal-culated by (19). The reference images used for LFR were theall-in-focus images generated at the two camera positions bythe proposed method with and 1. A comparison betweenthe results in Fig. 11 shows the advantage of our method overLFR. In the images obtained by our method, both object regionsare in-focus and properly shifted without noticeable artifacts. Incontrast, both regions in the images interpolated by LFR sufferfrom blur or ghosting artifacts. These artifacts are caused by in-correct pixel correspondences due to the large distance betweenthe cameras. Our method can prevent such incorrect pixel cor-respondences by properly shifting each object region. In addi-tion, this shifting operation can be achieved simply by spatiallyinvariant filtering of the reference images and does not requireregion segmentation.

Mean-square errors (MSEs) in all color channels were com-puted as a quality measure between the interpolated images andthe ground truth image, as shown in Fig. 12. MSEs in the LFRmethod were larger than those in the proposed method. Thequality of images interpolated by our method was sufficientlygood for all cases, whereas that of the images interpolated byLFR was much lower, becoming worse as the distance betweenthe cameras increased.

In the interpolated images, when the distance was 16 mm(bottom-left in the Fig. 11), occluded boundaries (the pencil inthe foreground and letters in the background) appeared trans-parent due to shifting and blending of different textures by ourmethod. Although our proposed method could not prevent thesetransparency effects, the blending of shifted textures in our ap-proach had the effect of canceling out errors in and

, which are the intermediate images modeled in (5)and (6) before blending. These images are shown in Fig. 13.There are noticeable artifacts in color value because the recon-struction filters for generating these images, , , , and

, had much larger values at lower frequency and amplifiedthe errors in the frequency. However, the finally interpolatedimage at the bottom-left in Fig. 11, which is a blended (average)of the intermediate images, has less visible errors, showing theerror-canceling effect.

B. Effect of Blur Estimation Error

In this section, we examine the robustness of our proposedview interpolation method against estimation error of blur

VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight
VicMcG
Highlight

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD 277

Fig. 11. Comparison between the proposed method and LFR method. (a) Theground truth image, which is the image actually captured at the center of the twocameras with a small aperture (F-number: 13). (b) Interpolated images by (left)the proposed method and (right) LFR method at the center of the two camerasfor various distances between the cameras, namely (from top to bottom) 4, 8,12, and 16 mm.

amount for the same scene. Setting various amounts of blurfrom 1.0 to 6.0 pixels, we interpolated intermediate images atthe center of two cameras placed 8 mm apart and measured

Fig. 12. Performance evaluation of the proposed method and the conventionalLFR for center view interpolation. MSEs were evaluated between the interpo-lated view images and the actually captured all-in-focus image (the ground truthimage) at the center position.

Fig. 13. Generated f (x; y; 0:5) and f (x; y; 0:5) that are averaged to pro-duce the final interpolated image shown at the bottom left in Fig. 11.

MSEs between the interpolated images and the ground truthimage. The measured MSEs of three color channels are shownin Fig. 14. Both MSEs in the red and green channels wereminimized at pixels; this blur amount is not the sameas the estimated blur amount of pixels described inSection III-B. This error is, however, allowed and does notdegrade the quality of the intermediate image, as shown inSection IV-A. This is because the MSE in the Red channel issmaller than 30 in the wide range of blur amounts from 3.2to 5.3 pixels, where the MSEs in the other color channels arelower than 23. Therefore, our method is robust against blurestimation errors. This result also shows that our method allowsthe scene to have some degree of depth variation so long as theamount of blur caused in the region is within that range. Thereason why the MSE in the Blue channel becomes lower as theblur amount increases is due to the low blue component in theimages.

V. CONCLUSION

This paper has presented a novel two-steps approach for IBR.Unlike the conventional IBR methods, the presented methoduses aperture cameras to capture reference images with dif-ferent focuses at different camera positions for a scene con-sisting of two depth layers. The first step is a view interpola-tion for densely generating all-in-focus intermediate images be-

278 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 14. Error evaluation of the proposed method for the center view interpola-tion under various blur amounts. MSEs were evaluated between the interpolatedview image and the ground truth image at the center position.

tween the camera positions. The obtained set of intermediateimages can be used as the dense light-field data for high-qualityview synthesis via LFR in the second step. We showed that theview interpolation can be achieved effectively by spatially in-variant filtering of the reference images, without requiring re-gion segmentation. The presented view interpolation methodworks well even for the case of a camera spacing sparser thanthat required for nonaliasing LFR.

REFERENCES

[1] M. Oliveira, “Image-based modeling and rendering techniques: Asurvey,” RITA—Rev. Inf. Teor. Apl., vol. IX, pp. 37–66, 2002.

[2] U. R. Dhond and J. K. Aggarwal, “Structure from stereo: A review,”IEEE Trans. Syst., Man, Cybern., vol. 19, no. 6, pp. 1489–1510, Jun.1989.

[3] C. Zhang and T. Chen, “A survey on image-based rendering—rep-resentation, sampling and compression,” EURASIP Signal Process.:Image Commun., vol. 19, pp. 1–28, 2004.

[4] H.-Y. Shum, S. B. He, and S.-C. Chan, “Survey of image-based rep-resentations and compression techniques,” IEEE Trans. Circuit Syst.Video Technol., vol. 13, no. 11, pp. 1020–1037, Nov. 2003.

[5] E. H. Adelson and J. R. Bergen, “The plenoptic function and the ele-ments of early vision,” in Computat. Models of Visual Processing, M.S. Landy and J. A. Movshon, Eds. Cambridge, MA: MIT Press, 1991,pp. 3–20.

[6] M. Levoy and P. Hanrahan, “Light field rendering,” in Proc. ACM SIG-GRAPH, 1996, pp. 31–42.

[7] H.-Y. Shum and L.-W. He, “Rendering with concentric mosaics,” inProc. ACM SIGGRAPH, 1999, pp. 299–306.

[8] J-X. Chai, X. Tong, S.-C. Chany, and H.-Y. Shum, “Plenoptic sam-pling,” in Proc. ACM SIGGRAPH, 2000, pp. 307–318.

[9] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan,“Light Field Photography With a Hand-Held Plenoptic Camera,” Tech.Rep. CSTR 2005–02, Comput. Sci. Dept., Stanford Univ, Stanford, CA,2005.

[10] D. G. Aliaga, T. Funkhouser, D. Yanovsky, and I. Carlbom, “Sea ofimages,” IEEE Comput. Graph. Appl., vol. 23, no. 6, pp. 22–30, Jun.2003.

[11] A. M. K. Sju and E. W. H. Lau, “Image registration for image-basedrendering,” IEEE Trans. Image Process., vol. 14, no. 2, pp. 241–252,Feb. 2005.

[12] R. T. Collins, “Space-sweep approach to true multi-image matching,”in Proc. CVPR, 1996, pp. 358–363.

[13] S. M. Seitz and C. R. Dyer, “Photorealistic scene reconstruction byvoxel coloring,” in Proc. CVPR, 1997, pp. 1067–1073.

[14] K. Kutulakos and S. M. Seitz, “A theory of shape by space carving,”Int. J. Comput. Vis., vol. 38, no. 3, pp. 199–218, Jul. 2000.

[15] R. Yang, G. Welch, and G. Bishop, “Real-time consensus-based scenereconstruction using commodity graphics hardware,” in Proc. PacificConf. Computer Graphics and Applications, 2002, pp. 225–235.

[16] I. Geys, T. P. Koninckx, and L. V. Gool, “Fast interpolated cameras bycombining a GPU based plane sweep with a max-flow regularisationalgorithm,” in Proc. Int. Symp. 3D Data Processing, Visualization andTransmission, 2004, pp. 534–541.

[17] K. Takahashi, A. Kubota, and T. Naemura, “A focus measure forlight field rendering,” in Proc. Int. Conf. Image Processing, 2004, pp.2475–2478.

[18] C. Zhang and T. Chen, “A self-reconfigurable camera array,” in Proc.Eurographics Symp. Rendering, 2004, pp. 243–254.

[19] J. Ens and P. Lawrence, “An investigation of methods for determiningdepth from focus,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no.5, pp. 97–107, May 1993.

[20] S. K. Nayar and Y. Nakagawa, “Shape from focus: An effective ap-proach for rough surfaces,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 16, no. 8, pp. 824–831, Aug. 1994.

[21] N. Rajagopalan, S. Chaudhuri, and U. Mudenagudi, “Depth estimationand image restoration using defocused stereo pairs,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1521–1525, Nov.2004.

[22] K. Aizawa, K. Kodama, and A. Kubota, “Producing object basedspecial effects by fusing multiple differently focused images,” IEEETrans. Circuits Syst. Video Technol., vol. 10, no. 2, pp. 323–330,Feb. 2000.

[23] M. Subbarao, T.-C. Wei, and G. Surya, “Focused image recoveryfrom two defocused images recorded with different camera settings,”IEEE Trans. Image Process., vol. 4, no. 12, pp. 1613–1627, Dec.1995.

[24] A. Kubota and K. Aizawa, “Reconstructing arbitrarily focusedimages from two differently focused images using linear filters,”IEEE Trans. Image Process., vol. 14, no. 11, pp. 1848–1859,Nov. 2005.

[25] N. Asada, H. Fujiwara, and T. Matsuyama, “Seeing behind the scene:Analysis of photometric properties of occluding edges by the reversedprojection blurring model,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 20, no. 2, pp. 155–167, Feb. 1998.

[26] R. Hsu, K. Kodama, and K. Harashima, “View interpolation usingepipolar plane images,” in Proc. IEEE Int. Conf. Image Processing,1994, pp. 745–749.

[27] A. Katayama, K. Tanaka, T. Oshino, and H. Tamura, “A viewpointdependent stereoscopic display using interpolation of multi-viewpointimages,” in Proc. SPIE Int. Conf. Stereoscopic Displays and VirtualReality Systems II, 1995, vol. 2409, pp. 11–20.

[28] M. Droese, T. Fujii, and M. Tanimoto, “Ray-space interpolationconstraining smooth disparities based on loopy belief propagation,”in Proc. Int. Workshop Systems, Signals, Image Processing, 2004, pp.247–250.

[29] A. Isaksen, M. Leonard, and S. J. Gortler, “Dynamically reparameter-ized light fields,” MIT-LCS-TR-778, 1999.

Akira Kubota (S’01–M’04) received the B.E. degreein electrical engineering from Oita University, Oita,Japan, in 1997, and the M.E. and Dr.E. degrees inelectrical engineering from the University of Tokyo,Tokyo, Japan, in 1999 and 2002, respectively.

He is currently an Assistant Professor with theDepartment of Information Processing at the TokyoInstitute of Technology, Yokohama, Japan. He wasa Postdoctoral Research Associate at the High-TechResearch Center of Kanagawa University, Japan,from August 2004 to March 2005. He was also a

Research Fellow with the Japan Society for the Promotion of Science fromApril 2002 to July 2004. From September 2003 to July 2004, he visited theAdvanced Multimedia Processing Laboratory, Carnegie Mellon University,Pittsburgh, PA, as a Research Associate. His research interests include imagerepresentation, visual reconstruction, and image-based rendering.

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD 279

Kiyoharu Aizawa (S’83–M’88) received theB.Eng., M.Eng., and Dr.Eng. degrees in electricalengineering from the University of Tokyo, Tokyo,Japan, in 1983, 1985, and 1988, respectively.

He is currently a Professor with the Departmentof Electrical Engineering and and the Department ofFrontier Informatics, University of Tokyo. He wasa Visiting Assistant Professor at the University ofIllinois at Urbana-Champaign, Urbana, from 1990to 1992.

His current research interests are in image coding,image processing, image representation, computational image sensors, multi-media capture, and retrieval of life log data.

wDr. Aizawa is a member of IEICE, ITE, and ACM. He received the 1987Young Engineer Award, the 1990 and 1998 Best Paper Awards, the 1991Achievement Award, the 1998 Fujio Frontier Award from ITE Japan, the 1999Electronics Society Award from the IEICE Japan, and The IBM Japan SciencePrize 2002. He serves as Associate Editor of the IEEE TRANSACTIONS ON

CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and is on the editorial boardof the Journal of Visual Communications and the IEEE Signal ProcessingMagazine. He has served for many national and international conferencesincluding IEEE ICIP, and he was the General Chair of the 1999 SPIE VCIP.He is a member of the Technical Committee on Image and MultidimensionalSignal Processing and the Technical Committee on Visual Communication ofthe SP and CAS societies, respectively.

Tsuhan Chen (S’90–M’93–SM’04) received theB.S. degree in electrical engineering from the Na-tional Taiwan University, Taipei, Taiwan, R.O.C., in1987, and the M.S. and Ph.D. degrees in electricalengineering from the California Institute of Tech-nology, Pasadena, in 1990 and 1993, respectively.

He has been with the Department of Electrical andComputer Engineering, Carnegie Mellon University,Pittsburgh, PA, since October 1997, where he iscurrently a Professor, and where he directs theAdvanced Multimedia Processing Laboratory. From

August 1993 to October 1997, he was with the Visual Communications Re-search Department, AT&T Bell Laboratories, Holmdel, NJ, and later at AT&TLabs-Research, Red Bank, NJ, as as Senior Technical Staff Member and then aPrinciple Technical Staff Member. He has co-edited a book titled MultimediaSystems, Standards, and Networks. He has published more than 100 technicalpapers and holds 15 U.S. patents. His research interests include multimediasignal processing and communication, implementation of multimedia systems,multimodal biometrics, audio-visual interaction, pattern recognition, computervision and computer graphics, and bioinformatics.

Dr. Chen helped create the Technical Committee on Multimedia SignalProcessing, as the founding Chair, and the Multimedia Signal ProcessingWorkshop, both in the IEEE Signal Processing Society. His endeavor laterevolved into his founding of the IEEE TRANSACTIONS ON MULTIMEDIA andthe IEEE International Conference on Multimedia and Expo, both joining theefforts of multiple IEEE societies. He was appointed the Editor-in-Chief of theIEEE TRANSACTIONS ON MULTIMEDIA for 2002 to 2004. Before serving as theEditor-in-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA, he also servedon the Editorial Board of the IEEE Signal Processing Magazine and as anAssociate Editor of the IEEE TRANSACTION ON CIRCUITS AND SYSTEMS FOR

VIDEO TECHNOLOGY, the IEEE TRANSACTIONS ON IMAGE PROCESSING, theIEEE TRANSACTIONS ON SIGNAL PROCESSING, and the IEEE TRANSACTIONS

ON MULTIMEDIA. He received the Benjamin Richard Teare Teaching Awardin recognition of excellence in engineering education from Carnegie MellonUniversity. He received the Charles Wilts Prize for outstanding independentresearch in electrical engineering leading to a Ph.D. degree at the CaliforniaInstitute of Technology. He was a recipient of the National Science FoundationCAREER Award, titled “Multimodal and Multimedia Signal Processing,” from2000 to 2003. He is a member of the Phi Tau Phi Scholastic Honor Society.