Enhancing and Experiencing Spacetime Resolution with...

Enhancing and Experiencing Spacetime Resolution with Videos and Stills

Ankit GuptaUniversity of Washington

Pravin BhatUniversity of Washington

Mira DontchevaAdobe Systems

Oliver DeussenUniversity of Konstanz

Brian CurlessUniversity of Washington

Michael CohenMicrosoft Research

Abstract

We present solutions for enhancing the spatial and/or tem-poral resolution of videos. Our algorithm targets the emerg-ing consumer-level hybrid cameras that can simultaneouslycapture video and high-resolution stills. Our technique pro-duces a high spacetime resolution video using the high-resolution stills for rendering and the low-resolution videoto guide the reconstruction and the rendering process. Ourframework integrates and extends two existing algorithms,namely a high-quality optical flow algorithm and a high-quality image-based-rendering algorithm. The frameworkenables a variety of applications that were previously un-available to the amateur user, such as the ability to (1) au-tomatically create videos with high spatiotemporal resolu-tion, and (2) shift a high-resolution still to nearby points intime to better capture a missed event.

1. Introduction

Still cameras are capturing increasingly high-resolution im-ages. In contrast, video resolution has not increased atnearly the same rate. This disparity in the two mediums isnot surprising as capturing high-resolution images at a highframe rate is a difficult and expensive hardware problem.Video produces enormous amounts of data, which must becaptured quickly using smaller exposure times.

While newer digital SLR cameras can also capture highquality video, these cameras are still priced at the higherend of consumer level cameras and leave a significant res-olution gap between photographs and videos. For example,the Nikon D90 can capture 12 MP images at 4 fps and 720PHD video (0.9 MP per frame) at 24 fps. The problem of cap-turing videos with high spatiotemporal resolution is furthercompounded by the constant push in the consumer marketfor miniaturization and integration of cameras with otherproducts (e.g., cell phones, PDAs). Hence, high spatial res-olution imagery is often incompatible with high frame rate

imagery, especially in the case of consumer level cameras,due to bandwidth and storage constraints.

In the face of these realities, we investigate software solu-tions that can increase the spatial and/or temporal resolutionof imagery recorded by hybrid cameras capable of captur-ing a combination of low-resolution video at medium framerates (15-30 fps) and high-resolution stills at low frame rates(1-5 fps). Such hybrid cameras have been previously pro-posed [10] and several have been built as research proto-types [18, 1, 27, 16]. Commercial hybrid cameras are cur-rently available (e.g., Sony HDR-HC7, Canon HV10, andCanon MVX330), and while these cameras have some lim-itations,1 newer models hold substantial promise; e.g., Fuji-film has announced the development of the Finepix 3D Sys-tem,2 which has two synchronized lenses and sensors capa-ble of capturing stills and video simultaneously.

We propose a framework for combining the output of hy-brid cameras. Our framework combines and extends twoexisting algorithms, namely a high-quality optical flow al-gorithm [22] and a high-quality image-based-rendering al-gorithm [2] to enable a variety of applications that were pre-viously unavailable to the amateur user, including:

• automatically producing high spatiotemporal resolu-tion videos using low-resolution, medium-frame-ratevideos and intermittent high-resolution stills,

• time-shifting high-resolution stills to nearby points intime to better capture a missed event.

We also describe a simple two-step flow algorithm that im-proves the quality of long-range optical flow in our setting(i.e., low-resolution video plus a few high-resolution stills).We demonstrate results using a simulated hybrid camera– simulated by downsampling existing video spatially andseparately sub-sampling it temporally – and using our own

1The number of stills that these cameras can capture during a videosession is currently limited to a maximum of three at one fps. In addition,we found the high-resolution stills produced by these cameras to not besignificantly higher in quality than the video frames.

2http://www.dpreview.com/news/0809/08092209fujifilm3D.asp

1

Image Based Rendering

graph-cut

reconstruction

spacetimefusion

OutputCorrespondence Estimation

Output

low resolution video

System Inputs

forward flow warped video

reverse flow warped video

low resolution video and

high resolution frames

enhanced video

Figure 1. Our system consists of two main components. We use a two-step optical flow process to compute correspondences betweenpixels in a low-resolution frame and nearby high-resolution stills. We employ the image based rendering algorithm of Bhat et al. [2007] torender the final result.

prototype still+video camera. Results of a user study ontime-shifting suggest that people would be quite interestedin using this technology to capture better imagery. In theremaining sections, we describe related work (Section 2),present our framework and algorithms (Section 3), exploreapplications (Section 4), and conclude with a summary anddiscussion (Section 5).

2. Related work

The problem of enhancing the resolution of images andvideos has received much attention. In this section we re-view some of the previous approaches to this problem.

Spatial and temporal super-resolution using multiplelow-resolution inputs. These techniques perform spatialsuper-resolution by first aligning multiple low-resolutionimages of a scene at sub-pixel accuracy to a referenceview [13, 25, 11]. The aligned images can then be usedto reconstruct the reference view at a higher spatial resolu-tion. The same idea can be applied in the temporal domainby using sub-frame-rate shifts across multiple low tempo-ral resolution videos of a scene to perform temporal super-resolution [26, 31, 32].

The techniques in this class rely on the assumption that thelow-resolution inputs are undersampled and contain alias-ing. However, most cameras usually bandlimit the highfrequencies in a scene to minimize such aliasing, whichseverely limits the amount of resolution enhancement thatcan be achieved using such techniques. Lin et al. [17]have studied the fundamental limits of these techniques andfound their magnification factor to be at most 1.6 in practi-cal conditions. Also, the reliance of these methods on staticscenes or multiple video cameras currently limits the prac-ticality of these methods. We overcome these limitations byusing data from a hybrid camera.

Temporal resolution enhancement. For temporal resolu-tion enhancement, most techniques [8, 7, 33, 9] computemotion correspondences using optical flow and perform a

weighted average of motion-compensated frames assuminglinear motion between correspondences. However, since theintermediate frame is generated as a weighted average of thetwo warped images, any errors in the correspondences canresult in artifacts such as ghosting, blurring and flickeringin the final result. We use a similar technique for generatingcorrespondences, but employ a graph-cut compositing anda spacetime gradient-domain compositing process to reducethese artifacts.

Learning-based techniques. These regression based tech-niques [12] learn a mapping between a patch of upsampledlow-resolution pixels and the corresponding high-resolutionpixel at the center of the patch to create a dictionary ofpatch-pairs. The high resolution image is synthesized us-ing the patch of low-resolution pixels surrounding everypixel in the input to infer the corresponding high-resolutionpixel from the dictionary. Often, the synthesis also incor-porates some smoothness prior to avoid artifacts that com-monly arise from relying on regression alone. Bishop etal. [3] proposed a similar technique for enhancing videoswith the patch-pairs from previously enhanced video framealso added to the dictionary which helps in temporal co-herence. While using dictionaries of examples is a gen-eral technique that can “hallucinate” high frequencies, it canalso lead to artifacts that are largely avoidable with hybridcameras, where the corresponding high frequency informa-tion is often captured in a different frame.

Combining stills with videos. Our technique can best bedescribed as reconstructing the high-resolution spacetimevideo using a few high-resolution images for rendering anda low-resolution video to guide the reconstruction and therendering. Bhat et al. [2] and Schubert et al. [24] proposeda similar approach to enhance low-resolution videos of astatic scene by using multi-view stereo to compute corre-spondences between the low-resolution video and the high-resolution images. In contrast, our method uses optical flowto compute correspondences and can therefore handle dy-namic scenes as well. Also, Schubert et al. [24] did notenforce any priors for temporal coherence across the gener-

ated video.

Sawhney et al. [23] use a stereo camera which captures twosynchronized videos from different viewpoints, one video atthe desired high resolution and another at a low resolution.They use stereo based image-based rendering techniques toincrease the spatial resolution of the second sequence bycombining the two streams. Compared to our method, thismethod is more restrictive as it needs to capture one video athigh spacetime resolution and it requires multiple synchro-nized cameras.

MPEG video coding works on a similar domain by stor-ing keyframes at full resolution and using block-based mo-tion vectors to predict the intermediate frames. Our sys-tem produces results significantly better than simply usinga MPEG-based frame prediction scheme since we use per-pixel flow estimation and a high quality rendering algorithmthat suppresses artifacts in regions with faulty flow esti-mates.

The method proposed by Watanabe et al. [30] works on sim-ilar input data as ours (i.e., a low-resolution video with a fewintermittent high-resolution frames). Each high-resolutionframe is used to propagate the high frequency informationto the low-resolution frames using a DCT fusion step. Na-gahara et al. [19] also take a similar approach but use featuretracking instead of motion compensation. These methodsgenerate each video frame independently and therefore areprone to temporal incoherence artifacts. See Section 4.1.1for a comparison to our method.

3. System overview

Figure 1 gives a visual description of our system for per-forming spatial and/or temporal resolution enhancement ofvideos using high-resolution stills when available.

3.1. Spatial resolution enhancement

The input consists of a stream of low-resolution frames withintermittent high-resolution stills. We upsample the low-resolution frames using bicubic interpolation to match thesize of the high-resolution stills and denote them by fi. Foreach fi, the nearest two high-resolution stills are denoted asSleft and Sright.

Computing motion correspondences. The system esti-mates motion between every fi and corresponding Sleft &Sright. Unfortunately, computing correspondences betweentemporally distant images of a dynamic scene is a hardproblem. Most optical flow algorithms can compute cor-respondences for motion involving only tens of pixels. In

our case the system needs to compute correspondences be-tween a high-resolution still and a low resolution frame thatmight contain objects displaced over hundreds of pixels.

One approach is to compute optical flow directly fromthe high-resolution stills, Sleft or Sright, to the upsampledframes fi. This approach, however, produces errors be-cause of the long range motion and the differences in im-age resolution. The flow quality can be improved by firstfiltering Sleft and Sright to match the low resolution of thevideo frames. We will denote these filtered images by fleft

and fright respectively. This improves matching, but theflow algorithm is still affected by errors from the long rangemotion. Instead of computing long range flow, one couldcompute pairwise optical flow between consecutive videoframes and sum the flow vectors to estimate correspon-dences between distant frames. This approach performsbetter, but flow errors between consecutive frames tend toaccumulate.

Our approach is to use a two-step optical flow process. First,we approximate the long range motion by summing theforward and backward optical flow between adjacent low-resolution frames. This chains together the flow from fleft

forward to each frame until fright, and similarly from fright

backward in time. Then, we use these summed flows toinitialize a second optical flow computation from fleft to fi

and from fright to fi. The summed motion estimation servesas initialization to bring long range motion within the oper-ating range of the optical flow algorithm and reduces theerrors accumulated from the pairwise sums. See Figure 4for a comparison between our method for computing corre-spondences and the alternatives mentioned above.

In our two-stage process, we employ the optical flow al-gorithm of Sand et al. [22]. Sand’s optical flow algorithmcombines the variational approach of Brox et al. [5] withXiao et al.’s [34] method for occlusion handling. We usedSand’s original implementation of the algorithm and its de-fault parameter settings for all of our experiments.

Graph-cut compositing: Once the system has computedcorrespondences from Sleft to fi and Sright to fi, it warpsthe high-resolution stills to bring them into alignment withfi thus producing two warped images, wleft and wright.Then it reconstructs a high-resolution version of fi usingpatches from wleft and wright. The reconstruction is com-puted using a multi-label graph-cut optimization with a met-ric energy function [4]. Each pixel in fi is given label fromthree candidates: wleft, wright, and fi. We use the standardenergy function used for graph-cut compositing with a datacost that is specialized for our problem and the smoothnesscost proposed by Kwatra et al. [15]. Kwatra’s smoothnesscost encourages the reconstruction to use large coherent re-gions that transition seamlessly from one patch to another.

Our data cost encourages the reconstruction to prefer labelsthat are likely to produce a high-resolution reconstructionwhile trying to avoid artifacts caused by errors in the corre-spondences.

The formal definition of our data cost function D for com-puting the cost of assigning a given label l to a pixel p is asfollows:

D(p, l) =

c if l = fi

∞ if wl(p)undefined

Dc(p, l) + Df (p, l) + Dd(l, fi) otherwise

Dc(p, l) = ||wl(p)− fi(p)||Df (p, l) = 1−motion confidence(wl(p))

Dd(l, fi) =|frame index(wl)− i|

|frame index(wright)− frame index(wleft)|

Here, c is the fixed cost for assigning a pixel to the low-resolution option (i.e, fi); wl is the warped image corre-sponding to the label l; Dc encourages color consistencybetween a pixel and its label; Df factors in the confidence ofthe motion vector that was used to generate the pixel wl(p);and Dd favors labels that are closer in temporal distance tothe current frame number i. All examples in this paper weregenerated by setting c to 0.3. The pixel channel values arein the range [0..1], and the confidence values (also in therange [0..1]) for the motion vectors are generated by Sand’soptical flow algorithm in the process of computing the cor-respondences. The confidence value for a motion vector toa pixel can also be understood as a probabilistic occlusionmap where low confidence value means that there is a highprobability that the pixel is occluded. We use a thresholdof 0.1 on this map to determine occluded pixels and wl isconsidered undefined at those locations.

Spacetime fusion. When each individual frame in thevideo has been reconstructed using the graph-cut composit-ing step described above, the resulting video has high spa-tial resolution but it suffers from the types of artifacts com-mon to videos reconstructed using pixel patches – that is,the spatial and temporal seams between the patches tend tobe visible in the result. These spatial seams can often bemitigated using the 2D gradient domain compositing tech-nique described by Perez et al. [20]. However, the temporalseams that arise due to errors in the motion vectors and ex-posure/lighting differences in the high-resolution stills canbe difficult to eliminate.

We use the spacetime fusion algorithm proposed by Bhat etal. [2], which is a 3D gradient domain compositing tech-nique that can significantly reduce or eliminate both spatialand temporal seams. Spacetime fusion takes as input the

spatial gradients of the high-resolution reconstruction andthe temporal gradients of the low-resolution video (com-puted along flow lines) and tries to preserve them simulta-neously. Thus, the temporal coherence captured in the low-resolution video is reproduced in the final high-resolutionresult, while preserving the high spatial resolution informa-tion as well as possible. We can assign relative weights tothe spatial and temporal gradients. Using only spatial gra-dients leads to high spatial but poor temporal quality. Usingonly temporal gradients leads to too much blurring in thespatial domain. In all of our experiments for spatiotemporalresolution enhancement, we set the temporal gradient con-straint weight to 0.85 and thus the spatial gradient constraintweight is 0.15. The reader is referred to Bhat’s spacetimefusion paper [2] for further details.

3.2. Temporal resolution enhancement

The temporal resolution of any video can be increased givengood flow estimation between frames. To increase the tem-poral resolution of a video by some factor, we insert theappropriate number of intermediate frames between exist-ing frames. To estimate flow vectors between neighboringframes (we’ll call these neighbors “boundary frames” forthe rest of this section) and the new intermediate frame, weassume that the motion varies linearly between the frames.The system simply divides the flow across the boundaryframes evenly between new intermediate frames (e.g., withthree intermediate frames, the flow to the first intermediateframe is 1/4 of the flow between boundary frames, 1/2 to thesecond and 3/4 to the third intermediate frame). Then, thetwo boundary frames are warped to the appropriate point intime and composited using the graph-cut compositing stepdescribed in Section 3.1 to construct the intermediate frame.Note that in this compositing step, the data cost D is definedas the sum of Df and Dd only. Dc is not used since it de-pends on the original low-resolution frame which does notexist.

Occlusions in the scene cause holes in the reconstruction.Previously, we used the low-resolution frames, fi, to fillthese holes. Now, we use spacetime fusion to remove theseholes (and other artifacts) by assuming all spatial gradientsinside holes are null, which allows surrounding informationto be smoothly interpolated within holes. For the spacetimefusion step, the system needs temporal gradients betweenthe new intermediate frames. Just as we assumed motionvaries linearly between video frames to compute the inter-mediate flow vectors, we assume temporal color changesvary linearly along flow lines. To estimate the temporal gra-dient between two intermediate frames we divide the tem-poral gradient between the boundary frames by one morethan the number of new intermediate frames.

4. Applications and analysis

In this section we explore several consumer-level appli-cations that could benefit from the ability to combinelow-resolution video with high-resolution stills to producevideos with high spatiotemporal resolution.

4.1. Spatial resolution enhancement

We first explore enhancing the spatial resolution of videoscontaining complex scenes, non-rigid deformations, and dy-namic illumination-effects. As mentioned in the introduc-tion, commercially available hybrid cameras do not yet cap-ture stills at the spatial and temporal resolution required byour system. To evaluate our approach we use two types ofdatasets – simulated and real.

We simulated the output of a hybrid camera by down-sampling high-resolution videos and using high-resolutionframes from the original video at integer intervals. Figure 2shows results on three such datasets. Please see the supple-mentary video for video results.

We also created a few more datasets by simultaneously cap-turing a scene using a camcorder and a digital SLR placedin close proximity. The camcorder captured the scene at0.3 megapixel resolution at 30 fps, while the SLR capturedthe scene at six megapixel resolution at three fps. For ourexperiments, we need to align these two streams in color,space and time. First, we match the color statistics of thetwo streams in LAB color space [21]. Then we compensatefor differences in camera positions and fields of view bycomputing a homography between a photograph and a cor-responding video frame. We apply this homography to allframes in order to spatially align them with the photographs.Finally for temporal alignment of data streams, we com-puted SIFT features for photographs and frames and formu-lated a simple dynamic programming problem to match thephotographs with the frames using the SIFT features, whilemaintaining their temporal ordering. Figure 3 shows resultson these datasets. We provide the corresponding video re-sults in supplementary material.

4.1.1 Qualitative and quantitative analysis

In this subsection we provide some qualitative and quanti-tative analysis of our method for spatial resolution enhance-ment.

As discussed in Section 3.1 there are a number of tech-niques for computing the motion correspondences betweenthe boundary and intermediate frames. In Figure 4 we com-pare these techniques using a challenging scenario that in-cludes water spray and fast motion. We improve the spatial

Downsampling factor: 12; High-res sampling rate: 3 fps



Figure 2. The left column shows the low-resolution input videoframe. The right column shows our result. We suggest zooming into see improvements at the actual scale of the images.

A

B

A

B

Figure 3. The figure shows spatial resolution enhancement resultsfor hybrid data captured using a two-camera setup. The left col-umn shows a result frame for each of the two data sets, (A) showszoomed in parts from low-resolution input frame, and (B) showsthe corresponding parts of result frame. Zoom in to see the reso-lution difference more clearly.

A B C D E

low-resolution frame result frame

Figure 4. The figure provides qualitative comparisons of our sys-tem to alternate approaches. (A) Computing flow directly be-tween high-resolution boundary stills and low-resolution interme-diate frames produces ghosting and blurring artifacts due to theresolution difference and long range motion. (B) Computing flowbetween low-resolution version of the boundary stills and low-resolution intermediate frames results in similar artifacts due tolong range motion. (C) Summed up pairwise optical flow to esti-mate the motion between distant frames performs better but stillsuffers from motion trails. (D) and (E) use our two-step opti-cal flow approach but use different rendering styles. (D) Takinga weighted average of the warped boundary-frames results in tear-ing artifacts. (E) Our rendering approach produces a video thatis visually consistent and has relatively few artifacts. We suggestzooming in to see improvement at the actual scale of the images.Note that (A)-(C) and (E) use our graph-cut spacetime fusion ren-dering; And (D)-(E) use our two-step optical flow process.

15

17

19

21

23

25

27

29

2 4 8 16

Warp + graph-cut +

gradient fusion

Warp + graph-cut

Warp + average

Low resolution video

High resolution sampling rate - 3 fps

Downsampling Factor

PSNR

A

18

20

22

24

26

28

30

5 10 20 30

High resolution sampling interval

(frames)

Warp + graph-cut +

gradient fusion

Warp + graph-cut

Warp + average

Low resolution video

Downsampling factor - 12

PSNR

B

Figure 5. (A) Variation of PSNR with respect to the spatialdownsampling factor of the low-resolution frames (B) Variationof PSNR with respect to the sampling rate of high-resolutionkeyframes

Our result Watanabe et al.’s result

Figure 6. Comparison with DCT based frequency fusion methodof Watanabe et al. [2006]. One can clearly see the ringing andblocking artifacts in regions where motion correspondences areincorrect. Zoom in to see the differences more clearly.

resolution by a factor of 12 in each dimension and showa qualitative comparison between our two-step optical flowapproach (Figure 4E) and three alternatives (Figure 4A-C).We also compare our rendering technique (Figure 4E) toa naive morphing based composite (Figure 4D). Our ap-proach, while not perfect, has fewer artifacts than any of thealternatives. The artifacts are seen in regions of the videowhere optical flow fails. Optical flow assumes brightnessconstancy, and errors can arise when this assumption is vi-olated. A notable example is the singer’s face in Figure 2,where the flickering flame illuminates her face; the result-ing flow errors result in a smooth warping artifact in the ren-dered result. Occlusions and large motion also cause errorin optical flow. In these regions our algorithm copies infor-mation from the low-resolution frame during the graph-cutcomposition. Any residual error is distributed across thevideo using spacetime fusion in order to further reduce theartifacts.

Figure 5 shows a quantitative analysis of our spatially en-hanced results by measuring the overall peak signal-to-noise ratio (PSNR) of the result video with respect to theoriginal high-resolution video. PSNR is widely used as acompression quality metric for images and videos. In thisanalysis we use the dolphin sequence shown in Figure 4.We explore PSNR variation with respect to the downsam-pling factor (Figure 5A) and the sampling interval of thehigh-resolution stills (Figure 5B) used to simulate the hy-brid camera data. Figure 5A shows that as the resolution ofthe input video decreases, the PSNR also decreases. Thiscorrelation is not surprising, as the resolution of the inputvideo will invariably affect the quality of the motion corre-spondences. Figure 5B shows that as the sampling intervalof the high-resolution stills increases, i.e., there are fewerhigh resolution stills, the PSNR also decreases. These fig-ures also compare the performance of warping the bound-ary frames and then (1) performing a weighted blend of thewarped frames, (2) creating a graph-cut composite, or (3)creating a graph-cut composite followed by gradient do-main integration (our method). The figures show that ourmethod outperforms the other methods.

We now compare our system to Watanabe et al.’s sys-tem [30] for increasing the spatial resolution of a low-resolution stream using periodic high-resolution framesthrough frequency spectrum fusion. We implemented theirapproach and show the comparison in Figure 6 and in thesupplementary video. Their method processes each inter-mediate frame separately and does not ensure temporal co-herence. By comparison, the spacetime fusion step in ourframework adds coherence and mitigates artifacts due tobad motion correspondence. Our graph-cut based imagecompositing uses the flow confidence for assigning appro-priate labels to pixels. Watanabe’s frequency spectrum fu-sion technique fuses the high frequency content everywhereleading to artifacts where motion correspondences are bad.Further, their use of block DCTs introduces ringing andblocking artifacts.

4.2. Time shift imaging

Although advances in digital cameras have made it easier tocapture aesthetic images, photographers still need to knowwhen to press the button. Capturing the right instant is oftenelusive, especially when the subject is an exuberant child, afast-action sporting event, or an unpredictable animal. Shut-ter delay in cameras only exacerbates the problem. As aresult, users often capture many photographs or use the mo-tor drive setting of their cameras when capturing dynamicevents.

Given the input from a hybrid camera, our system can helpalleviate this timing problem. We assume the hybrid cam-era is continuously capturing a low-resolution video and pe-riodically capturing high-resolution stills. When the user“takes a picture,” the camera stores the video interval be-tween the last and next periodic high-resolution stills andalso three high-resolution stills (two periodic captures andthe one clicked by the user which is positioned at some ran-dom time moment between the periodic stills). Using ourspatial enhancement approach, we can propagate the high-resolution information from the high-resolution still to thesurrounding low-resolution frames, thus producing a veryshort high-resolution video around the high-resolution stillcaptured by the user. This high-resolution image collectionenables users to choose a different high-resolution framethan the one originally captured. We envision that the abil-ity to shift a still backward or forward in time will makeit easier to capture that “blowing out the candles” moment,or that “perfect jumpshot.” Figure 7 shows a result for thisapplication.

To assess the utility of time-shifting, we performed a userstudy. In the study, 17 users were shown 22 videos and, foreach video, were asked to indicate with a keystroke whenthey would like to take a snapshot of the action. After-

A B

C D

Figure 7. Images (A) and (B) show high-resolution frames cap-tured by the hybrid camera. Image (C) shows an intermediatelow-resolution video frame. Note that the eyes of the baby areclosed in (A) and the dog’s snout is far away in (B). A more pho-tographic moment occurs in (C), where the eyes are open and thesnout is close to the face. Our system generates a high spatial res-olution frame for (C) as shown in (D) by flowing and compositinghigh-resolution information from (A) and (B).

wards, they were given the opportunity to replay each videowith a time slider to indicate which frame they had intendedto capture. On average, our participants intended to selecta frame different from their snapshots 85.7% of the time.Further, informal comments from these users revealed thatthey are looking forward to a time-shifting feature in theircamera to help them capture the moment of their choice.

4.3. Temporal resolution enhancement

As mentioned in Section 3.2, by assuming linear motionbetween consecutive video frames, our system can insertan arbitrary number of intermediate frames to enhance thetemporal resolution of a video. We show some results in thesupplementary video. Most previous methods for temporalresolution enhancement [28] take this same approach. Ourmethod differs in the rendering phase.

Existing techniques use weighted averaging of warpedframes to hide artifacts resulting from bad correspondences.In comparison, our rendering method (i.e., graph-cut com-positing plus spacetime fusion) focuses on preserving theoverall shape of objects and hence leads to stroboscopicmotion in regions where the motion correspondences areextremely bad. This can be observed for the motion of theplayer’s hand in the cricket clip (in supplementary video).Therefore, the artifacts of our technique are different fromthose of previous techniques in regions where flow fails.Our technique may be preferred for enhancing videos thatinvolve objects like human faces where distortion and tear-ing artifacts would be jarring. On the other hand, the distor-

tion and tearing artifacts common in previous methods looklike motion blur for certain objects, which in turn makestheir results look temporally smoother for regions with badflow. Like most previous methods, our method does not re-move motion blur present in the input video for fast movingobjects, which can seem perceptually odd when these fastmoving objects are seen in slow motion.

We can also combine the spatial and temporal steps ofour approach to produce spatiotemporally enhanced videos.We show several examples of temporal and spatiotemporalresolution enhancement in the supplementary video. Theability to insert intermediate frames also allows the usersto speed-up or slow-down different parts of a video usinga simple curve-based interface; a demo of which is alsoshown in the supplementary video.

4.4. Computational cost

We compute pairwise optical flow for a 640x480 frame in3 minutes with a 3.4 GHz processor and 4GB RAM. Doingforward and backward computation in each of the two stepsrequires 4 optical flow computations per frame. In practice,we compute optical flows in parallel on multiple machines.The graph-cut compositing for each frame takes around 30seconds. The spacetime fusion is performed on the wholevideo using a simple conjugate gradient solver and the av-erage time per frame is also around 30 seconds. We expectthat this performance could be dramatically improved withmulti-grid solvers and GPU acceleration [6, 14, 29].

5. Conclusion

We have demonstrated the power of combining an opticalflow algorithm with an image-based-rendering algorithm toachieve numerous enhancements to the spatial and tempo-ral resolution of consumer-level imagery. We hope that ourwork will encourage camera manufacturers to provide morechoices of hybrid cameras that capture videos and stills si-multaneously. The combination of good hybrid camerasand software solutions like ours can further bridge the qual-ity gap between amateur and professional imagery.

Currently the capabilities of our framework depend on thequality of motion correspondences produced by the opti-cal flow algorithm. Unlike previous work, our method failsgracefully in regions where flow fails, by defaulting to thelow-resolution video in those regions. One improvementwould be to fill in detail with an learning-based super-resolution approach [12] by using the intermittent high-resolution stills as training data. Improving the flow algo-rithm would also help, of course; as motion correspondence

algorithms improve, we will be able to apply our frameworkto a broader set of scenarios, such as videos with larger andfaster motion. We also envision using our framework to in-crease the temporal resolution of high-resolution stills cap-tured in motor-drive on an SLR. Additionally, the capabil-ity to generate high spacetime resolution videos from hybridinput could possibly be used as a form of video compression(i.e., storing just the hybrid video and doing decompressionby synthesis only when needed).

Acknowledgments

We thank Peter Sand for providing the implementation ofhis optical flow algorithm. We would also like to thank EliShechtman and the reviewers for their insightful comments.This work was supported by NVIDIA and Microsoft Re-search fellowships and by funding from the University ofWashington Animation Research Labs, Microsoft, Adobe,and Pixar.

References

[1] M. Ben-Ezra and S. K. Nayar. Motion-based motion deblur-ring. IEEE Transactions on Pattern Analysis and MachineIntelligence, 26(6):689–698, 2004. 1

[2] P. Bhat, C. L. Zitnick, N. Snavely, A. Agarwala,M. Agrawala, B. Curless, M. Cohen, and S. B. Kang. Usingphotographs to enhance videos of a static scene. In EGSR’07: Proc. of Eurographics Symposium on Rendering, pages327–338. Eurographics, June 2007. 1, 2, 4

[3] C. Bishop, A. Blake, and B. Marthi. Super-resolution en-hancement of video. Proc. of the 9th Conference on ArtificialIntelligence and Statistics (AISTATS), 2003. 2

[4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 23(11):1222–1239,2001. 3

[5] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. Highaccuracy optical flow estimation based on a theory for warp-ing. In ECCV ’04: Proc. of the 8th European Conference onComputer Vision, volume 4, pages 25–36, 2004. 3

[6] A. Bruhn, J. Weickert, C. Feddern, T. Kohlberger, andC. Schnorr. Variational optical flow computation in realtime. IEEE Transactions on Image Processing, 14(5):608–615, May 2005. 8

[7] C. Cafforio, F. Rocca, and S. Tubaro. Motion compen-sated image interpolation. IEEE Trans. on Communications,38(2):215–222, Feb 1990. 2

[8] R. Castagno, P. Haavisto, and G. Ramponi. A method formotion-adaptive frame rate up-conversion. IEEE Trans. Cir-cuits and Systems for Video Technology, 6(5):436–446, Oc-tober 1996. 2

[9] B. Choi, S. Lee, and S. Ko. New frame rate up-conversionusing bi-directional motion estimation. IEEE Trans. on Con-sumer Electronics, 46(3):603–9, 2000. 2

[10] M. F. Cohen and R. Szeliski. The moment camera. IEEEComputer, 39(8):40–45, 2006. 1

[11] M. Elad and A. Feuer. Restoration of a single superreso-lution image from several blurred, noisy, and undersampledmeasured images. IP ’97: IEEE Transactions on Image Pro-cessing, 6(12):1646–1658, Dec. 1997. 2

[12] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. IEEE Computer Graphics and Ap-plications, 22(2):56–65, 2002. 2, 8

[13] M. Irani and S. Peleg. Improving resolution by image reg-istration. CVGIP: Graphical Models and Image Processing,53(3):231–239, 1991. 2

[14] M. Kazhdan and H. Hoppe. Streaming multigrid forgradient-domain operations on large images. SIGGRAPH’08, ACM Transactions on Graphics, 27(3):1–10, 2008. 8

[15] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick.Graphcut textures: Image and video synthesis using graphcuts. SIGGRAPH ’03, ACM Transactions on Graphics,22(3):277–286, July 2003. 3

[16] F. Li, J. Yu, and J. Chai. A hybrid camera for motion deblur-ring and depth map super-resolution. In CVPR ’08: Proc. ofthe 2008 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, 2008. 1

[17] Z. Lin and H.-Y. Shum. Fundamental limits ofreconstruction-based superresolution algorithms under localtranslation. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 26(1):83–97, 2004. 2

[18] H. Nagahara, A. Hoshikawa, T. Shigemoto, Y. Iwai,M. Yachida, and H. Tanaka. Dual-sensor camera for ac-quiring image sequences with different spatio-temporal res-olution. In AVSBS ’05: Advanced Video and Signal BasedSurveillance, pages 450–455, 2005. 1

[19] H. Nagahara, T. Matsunobu, Y. Iwai, M. Yachida, andT. Suzuki. High-resolution video generation using morph-ing. In ICPR ’06: Proc. of the 18th International Confer-ence on Pattern Recognition, pages 338–341, Washington,DC, USA, 2006. IEEE Computer Society. 3

[20] P. Perez, M. Gangnet, and A. Blake. Poisson image editing.In SIGGRAPH ’03: ACM Transactions on Graphics, pages313–318, New York, NY, USA, 2003. ACM Press. 4

[21] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Colortransfer between images. IEEE Computer Graphics and Ap-plications, 21(5):34–41, 2001. 5

[22] P. Sand and S. Teller. Particle video: Long-range motion es-timation using point trajectories. In CVPR ’06: Proc. of the2006 IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, pages 2195–2202, Washing-ton, DC, USA, 2006. IEEE Computer Society. 1, 3

[23] H. S. Sawhney, Y. Guo, K. Hanna, R. Kumar, S. Adkins, andS. Zhou. Hybrid stereo camera: an ibr approach for syn-thesis of very high resolution stereoscopic image sequences.In SIGGRAPH ’01: ACM Transactions on Graphics, pages451–460, New York, NY, USA, 2001. ACM. 3

[24] A. Schubert and K. Mikolajczyk. Combining high-resolutionimages with low-quality videos. In Proceedings of theBritish Machine Vision Conference, 2008. 2

[25] R. Schultz and R. Stevenson. Extraction of high-resolutionframes from video sequences. IP ’96: IEEE Transactions onImage Processing, 5(6):996–1011, June 1996. 2

[26] E. Shechtman, Y. Caspi, and M. Irani. Space-time super-resolution. IEEE Trans. on Pattern Analysis and MachineIntelligence, 27(4):531–545, 2005. 2

[27] Y. Tai, H. Du, M. Brown, and S. Lin. Image/video deblurringusing a hybrid camera. In CVPR ’08: Proc. of the 2008IEEE Computer Society Conference on Computer Vision andPattern Recognition, 2008. 1

[28] D. Vatolin and S. Grishin. N-times video frame-rate up-conversion algorithm based on pixel motion compensationwith occlusions processing. In Graphicon: InternationalConference on Computer Graphics and Vision, pages 112–119, 2006. 7

[29] V. Vineet and P. Narayanan. Cuda cuts: Fast graph cuts onthe gpu. Computer Visions and Pattern recognition Work-shops, 2008, pages 1–8, 2008. 8

[30] K. Watanabe, Y. Iwai, H. Nagahara, M. Yachida, andT. Suzuki. Video synthesis with high spatio-temporal resolu-tion using motion compensation and spectral fusion. IEICETransactions on Information and Systems, E89-D(7):2186–2196, 2006. 3, 7

[31] Y. Wexler, E. Shechtman, and M. Irani. Space-time comple-tion of video. IEEE Transactions on Pattern Analysis andMachine Intelligence, 29(3):463–476, 2007. 2

[32] B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez,A. Barth, A. Adams, M. Horowitz, and M. Levoy. High per-formance imaging using large camera arrays. SIGGRAPH’05: ACM Transactions on Graphics, 24(3):765–776, 2005.2

[33] W. F. Wong and E. Goto. Fast evaluation of the elementaryfunctions in single precision. IEEE Trans. on Computers,44(3):453–457, 1995. 2

[34] J. Xiao, H. Cheng, H. S. Sawhney, C. Rao, and M. A. Is-nardi. Bilateral filtering-based optical flow estimation withocclusion detection. In ECCV ’06: Proc. of the EuropeanConference on Computer Vision, pages 211–224, 2006. 3

Enhancing and Experiencing Spacetime Resolution with...

Documents

Transcript of Enhancing and Experiencing Spacetime Resolution with...