Real-Time 2D to 3D Conversion from H.264 Video Compressionamir1/PS/3d.pdf · 1 Real-Time 2D to 3D...

1

Real-Time 2D to 3D Conversion from H.264Video CompressionNir Shabat, Gil Shabat and Amir Averbuch

Abstract—We present a real-time procedure for an automatic conversion of 2D H.264 video into a 3D stereo. The proposedalgorithm works with H.264 compressed videos by taking advantage of its advanced motion compensation (MC) features such asmultiple references and quarter-pixel accuracy. The algorithm is based on incorporating our Depth-From-Motion-Compensation (DFMC)algorithm into H.264 decoding process. The DFMC algorithm uses the MC data in the compressed video to estimate a depth map foreach video frame. These depth maps are then used to generate in real time a stereo video via a Depth-Image-Based-Rendering(DIBR).

Index Terms—depth estimation, 3d conversion, motion compensation, MPEG, H.264, real-time

F

1 INTRODUCTION

IN recent years, we witness a sharp increase in thedemand and availability of 3D capable devices such as

TV sets and cellular phones. This rise is somewhat heldback by the fact that the vast majority of video contentavailable today is 2D video, and that 3D video recordingin new productions is both expensive and complicated.In order to bridge the gap between the availability andthe demand for 3D content, different stereo conversiontechniques are being used for converting 2D video into3D.

Most stereo conversion algorithms first estimate adepth map for each image in the video sequence,and then use depth-image-based rendering (DIBR) tech-niques to generate images for each eye. These algorithmstry to facilitate different depth cues in the image se-quence for inferring the depth for different areas in theseimages. When working with a single 2D image or a 2Dvideo, where the view is monocular, the depth cues aresometimes referred to as Monocular Depth Cues. Someexamples for monocular depth cues, which are used fordepth estimation and perception, are structure [1], objectfamiliarity [2], [3], [4], lightning [5], [6] and motion [7],[8], [9], [10], [11] (i.e. in video).

An important aspect for 2D to 3D stereo conversionalgorithms lies in their performance and in their ca-pability to perform the conversion in real-time. Usingsemi-automatic or computationally-intensive techniquesin the process of depth map estimation can result in ahigh-quality 3D video, however, these techniques are notsuitable for usage with either live broadcast or streamingcontent. It is evident that stereo conversion techniquesfor 3D capable TVs and cell phones must perform thestereo conversion in real-time.

N. Shabat, Google Tel Aviv, Electra Tower, 29th Floor, 98 Yigal Alon, Tel-Aviv, 6789141, IsraelA.Averbuch, G. Shabat, School of Computer Science, Tel Aviv University, TelAviv 69987, Israel

In this paper, we present a new algorithm for stereoconvxersion of 2D video that is based on the H.264 videocompression standard. We call this algorithm Depth FromMotion Compensation (DFMC - see [12]). Similarly tomany stereo conversions algorithms, the focus of ourDFMC algorithm is in estimating a depth map to besubsequently used for stereo view synthesis.

The depth map estimation process of our algorithm isbased on the Motion Compensation (MC) data, whichis part of every H.264-based compressed video. TheMC data, which is extracted during the video decod-ing process, is used to calculate the relative projectedvelocity of the elements in the scene. This velocity isused to estimate the relative depths of pixel patchesin each picture of the video scene sequence. The depthestimation process is performed alongside the standardH.264 decoding process for increased efficiency, wherethe initial estimation of a patch depth is performedimmediately after its decoding is completed.

In order to achieve spatial and temporal smoothness,we apply a joint bilateral filter on the estimated depthmaps and then calculate a weighted average on thefiltered result. The averaged depth maps are then used ina DIBR procedure to synthesize the resulting stereo im-ages. We show how our stereo conversion procedure canbe integrated into the normal flow of the H.264 decodingprocess. Figure 1.1 depicts the data flow diagram of astereo conversion system that incorporates our DFMCalgorithm.

We implemented our DFMC based stereo conversionprocedure within the open source FFmpeg [13] H.264decoder. We then applied it to several video sequencesincluding nature scenes (“National Geographic - Amaz-ing Flight Over the Grand Canyon”) and movie trailers(“Batman: The Dark Knight” and “X-Men: Days of Fu-ture Past”).

This paper has the following structure. Section 2 dis-cusses related work on 2D to 3D conversion, focusing ondepth-map-based procedures for stereo conversion. Sec-

2

Fig. 1.1: Data flow diagram for the 2D to 3D stereo conversion procedure resulting from the incorporation of ourDFMC algorithm into the H.264 decoding process. The blue elements are part of the H.264 standard decodingprocess, while the red elements are added to support the stereo conversion.

tion 3 describes necessary background in 2D to 3D stereoconversion such as DIBR, motion parallax and the H.264video standard. Section 4 presents a detailed descriptionof the DFMC algorithm. Finally, section 5 presents results(performance and quality) from the application of thealgorithm to various video content.

2 RELATED WORK

The area of 2D to 3D conversion (which bears manysimilarities to the area of 3D reconstruction), has beenextensively studied over the years [14], [15], [16], [17].The main task in the majority of 2D to 3D conversiontechniques is in estimating the depth map for a given 2Dimage (or multiple depth maps for an image sequence).The depth map is generated by exploiting different depthcues in the 2D content. This depth map is then usedto synthesize the views from the left and right eyes(or alternatively, to synthesize the view from one eye,using the source image as the view from the other). Viewsynthesis is performed by employing depth-image-basedrendering (DIBR) techniques.

There are many approaches for estimating the depthmap of an image. Semi-automatic methods require theuser to provide partial depth or disparity information,usually by marking scribbles on a few key frames [18],[19], [20]. This partial depth information is then prop-agated to the entire video scene by different methods.Another approach uses a shifted bilateral filter, which isa form of bilateral filtering, that takes motion into con-sideration [19]. A different path is taken in [18], where aSVM classifier (trained on the partial data) is employedto define constraints for an optimization problem. Thisoptimization problem is then solved as a sparse linearsystem. Tracking is used in [21] to propagate objects’depth from key frames into other frames.

Another approach uses the structural and the geo-metric cues of an image to infer its depth map. Theabsolute mean depth of a scene is estimated in [1]by using features that are based on local and globalspectral signatures in the image. In [22], the image is firstclassified into one of three structural categories (indoor,outdoor with geometric elements and outdoor without

geometric elements), followed by the detection of van-ishing points, before the actual depth is assigned. A two-phase approach is used in [23] where in the first phasedifferent parts of the image are labeled semantically(with labels such as ground, sky, building, etc..). In thesecond phase, the depth of each pixel (or super-pixel) isestimated in relation to the label given to it in the firstphase.

Recent methods approach the depth estimation prob-lem from a machine learning point of view. A MarkovRandom Field (MRF) is used in [24], [25], [26], [27], [28]to model the relationship between the depth of an imagepatch and the depth of its neighboring patches. The MRFis trained on a database of images and their correspond-ing ground-truth depth maps. A data-driven approachis taken in [29] where a database of image-plus-depthpairs is employed. The k-nearest neighbors (k-NN) of thequery image are found in the database. The Euclideannorm of the difference between histograms of orientedgradients (HOG) [30] is used as the distance measure.Then, the depth maps of these k neighbors are fusedby a median filter to a single depth map. The resultingdepth map is then smoothed by the application of abilateral filter. A similar approach is described in [31],where the k-nearest candidate images are selected basedon a combination of GIST [32] features and featuresderived from a calculated optical flow. The SIFT flows[33], which are computed from the input image to each ofthe k image candidates, are used in wrapping each of thecandidates’ depth maps. Finally, an objective functionis defined for fusing the warped depth maps. Thisfunction is minimized using an iteratively reweightedleast squares (IRLS) procedure. A supervised learningapproach, which uses the random forest method, isapplied in [34] to depth map estimation. The modelis trained on feature vectors, which are based on avariety of depth cues, in order to get good results forunconstrained input videos.

Motion is often used when inferring the depth of animage sequence. A popular approach is to use the tech-nique of structure from motion (SFM) [35], [36]. In SFM,a correspondence between subsequent video frames is

3

established. Then, it is used to compute the camera mo-tion and the 3D coordinates of the matched image points.A SFM procedure is used in [7] to estimate the depth offeature points arriving at a sparse depth map. To achievea dense depth map, a Delaunay triangulation is gener-ated for each input image. Then, depth is assigned to apixel of each triangle according to the depth of featurepoints covered by the triangle. SFM techniques are usedin [37], [38] to estimate the camera parameters of eachframe in the video. These camera parameters are thenused in an energy function to enforce photo-consistency.Subsequently, loopy belief propagation [39] is employedin minimizing the energy function thus generating theinitial depth map estimation. This initial estimation isthen improved iteratively by minimizing another energyfunction using an optimization framework called bundleoptimization. The colors of the input image are used in[9], [10] to enhance the motion estimated depth maps. Atwo-phase approach is employed by [9]. First, the pixelsof the image are clustered based on their color using k-means, then, the clustering is refined using motion andedge information, where the pixels are classified intonear and far categories. A coarse depth map is estimatedin [10] by employing a block-matching algorithm for cal-culating the motion vectors (MVs) between an image andits reference image. The scaled size of each calculatedMV is used as the depth of its block in the coarse depthmap, which is later refined by utilizing image color.

In order to achieve a real-time performance, recentmethods [11], [40], [41], [42], [43] utilize the MC data invideos compressed by either MPEG-2 or H.264/MPEG-4AVC standards [44]. The scaled size of each MV is usedas the depth of its associated macro block (MB) in [40]as an initial depth map, which is then smoothed using anearest-neighbor method. The horizontal component ofeach MV is used in [41] as an initial depth estimation,which is then applied several corrections based on thecamera’s motion, displacement errors and object borders.The size of MVs is similarly used in [11] for a coarsedepth map estimation, which is then refined by usinga Gaussian mixture model (GMM) [45] to detect themoving objects. This motion based depth map is thenfused with a geometric based depth map to produce thefinal result.

Our suggested procedure utilizes the idea of using mo-tion information stored by motion-compensated-basedvideo formats to builds upon it. We use the MC dataof a frame not only to estimate depth within the frame,but also to estimate the depth in the referenced frames.The use of this “backward” prediction allows for a betterdepth estimation, since it allows depth estimation forintra-predicted pixel partitions. Our procedure also takesadvantage of H.264 advanced features such as quarter-pixel precision of MV and motion predicted partitions assmall as 4×4 pixels. We describe a complete frameworkfor incorporating our stereo conversion procedure intoan H.264 decoder that supports real-time 2D to 3D stereoconversion.

Fig. 3.1: An example of a color image (top) and itscorresponding depth map (bottom). Areas in the colorimage, which are closer to the camera, appear brighterin its depth map

3 STEREO CONVERSION

3.1 3D Stereo VideoWe usually view the world from two viewing angles dueto the separation of the eyes. These two different viewingangles result in slightly different images seen by the leftand right eye. This difference in images, and specificallythe difference in the locations of objects between them, iscalled binocular disparity. Binocular disparity is used bythe brain to extract depth information from these twoimages. The depth perceived by the brain due to thisdifference is called stereopsis. The vast majority of 3Dtechniques exploit stereopsis for creating the illusion ofdepth on 2D displays (such as TV sets) by presentinga different image to each eye. The use of stereopsis forthis purpose is called stereoscopy. In order to create aperceived depth illusion for a 2D monocular image usingstereopsis, one needs to generate two different imagesfrom the 2D image, which represent the two differentviewing angles of the eyes.

3.2 Depth MapsA depth map (or depth image) of an image is anotherimage in which the intensity (or color) of each pixelrepresents the distance (or the depth) from a viewpointof the corresponding pixel in the first image (see Fig. 3.1).Depth maps can be acquired directly by using tools suchas a laser scanner, or be generated synthetically from

4

x

z

f

L RC

PPz

Px

Xl Xc Xr

Fig. 3.2: The disparity of a point P between images takenfrom a camera with focal length f at points L, C and R

ordinary 2D content (from one or more viewpoints).Synthetic depth maps may either be produced in amanual process, or by an automatic (or semi-automatic)procedure. Depth maps have numerous applications, oneof which is rendering a scene from different view pointsin a process which is called Depth-Image-Based Rendering(DIBR), as explained in section 3.3.

3.3 Depth-Image-Based Rendering (DIBR)DIBR techniques use an image depth map to synthesizedifferent images, pertaining to different viewing anglesof a scene, from a single image of that scene. In 2D to3D stereo conversions algorithms, these techniques areused to generate images for the left and right eyes from amonocular image and its depth map. Conceptually, DIBRis a two step process. Initially, the image is projectedinto 3D coordinates using the information of its depthmap. Then, the 3D image is reprojected onto the 2Dplane of a virtual camera. This process of projecting andreprojecting the image is called image warping. For thecase of stereoscopic image creation, in which the virtualcameras are simple translations of the actual camera,image warping takes a simpler form.

Let P be a 3D point with coordinates (Px, Py, Pz), thenwe have (see Fig. 3.2):

Xl =f · (Px − L)

Pz, Xc =

Pxf

Az, Xr =

f · (Px −R)

Pz

where f is the focal length of the cameras, L and Rare the horizontal coordinates of the left and right eyes“virtual” cameras, respectively, and Xc, Xl, Xr are thehorizontal distances between the cameras’ centers andthe projection of the point P .

Denoting |L| = |R| = tc2 , where tc is the interaxial

distance (the horizontal distance between the two view

ports), the disparity for the left and right images be-comes

∆s = ± tcf2Pz

. (3.1)

Given the focal length f , the interaxial distance tc anda dense depth map (giving Pz at each pixel), we can useEq. 3.1 to calculate the disparity of each pixel betweenthe input image and the images for the left and right eyesneeded for stereopsis. In the case of a viewer, watchinga video on a 2D display, f is essentially the viewer’sdistance from the screen. tc is usually chosen as thedistance in humans between the eyes, which on averageis 64mm.

3.4 Motion ParallaxParallax is defined as the displacement or difference inthe apparent position of an object viewed along twodifferent lines of sight. Objects, which are closer tothe viewer, have larger parallax than objects which arefurther away. This fact lets us use parallax in order toestimate objects’ distance.

Motion parallax is the parallax created due to themovement of the viewer. Since objects, which are closerto the viewer, have larger parallax during the movementof the viewer, they appear to move faster than objectswhich are more distant (see Fig. 3.3).

In a monocular image sequence (e.g. video), this paral-lax can be used as a depth cue for estimating the relativepixels’ depth. Consider a camera moving horizontallyfrom a point C to a point R in Fig. 3.2. The disparity ofpoint P between the two images, taken by the cameraat points C and L, is given by Eq. 3.1. According to Eq.3.1, bigger values of Pz will decrease the disparity ∆sand vice versa. This means that we can use 3.1 to inferthe relative pixels’ depths from their disparity betweenthe two images.

In order to use the disparity of a pixel between twoimages (taken during camera movement, i.e. from twodifferent viewing angles) we need to match the pixelbetween those two images. This parallax can be usedto estimate the relative depth of that pixel using Eq. 3.1.

3.5 Motion Compensation-Based VideoModern video compression formats use an algorithmictechnique called Motion Compensation to utilize the tem-poral redundancy in videos. These formats can describe(or predict) video frames in terms of other video frames,which are called reference frames, or just references(in this work we use the term frame loosely, referringto both full frames and frame fields). The referencesof a frame can either precede it in time, or belongto a future time (presentation-wise). The description ofvideo frames in terms of other frames is called interframe prediction, or inter-prediction, for short, and theseframes are called inter predicted frames, or inter-frames.In contrast, frames, which are not described in terms of

5

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 3.3: Images taken at constant time intervals from a camera moving at constant speed from right to left. Thegreen box, which is closer to the camera, appears to be moving faster than the distant red box.

other frames, and for which decoding can be performedby using only information contained within them (i.e.without any reference frames), are called intra predictedframes, or intra-frames, for short. A series of subsequentframes, which can be decoded independently of the restof the video, is called a Group-of-Pictures (GOP).

In order to describe a frame in terms of other frames,the frame is split into pixel blocks called Macro Blocks(MB). These MBs can be further split into smaller blockscalled MB Partitions. Each MB or MB partition of aninter-frame can be associated with (one or more) vectorscalled Motion Vectors (MVs). Each MV points to a locationin a reference frame which is used to describe the MB(or MB partition) it is associated with. This allows thevideo encoder to only store the difference (or the error)between the inter-predicted MB (or MB partition) and itsreferences, instead of the entire MB (or MB partition).

Video compression standards, employing MC tech-niques, usually use three types of frames: I-frames, P-frames and B-frames according to the prediction typeused to encode the frame. I-frames are intra frames, P-frames are inter frames, which can reference other framesthat precede them in time (i.e. backward references), andB-frames are inter frames, which can reference framesthat both precede them or follow them in time (i.e.backward and forward references).

In order for a decoder to decode an inter-frame i, itneeds to have access to all the frames referenced byi (in their decoded form). To achieve this, all frames

referenced by the inter-frame i must arrive at the decoderprior to the arrival of i itself. The order by which theframes arrive at the decoder is called Decoding Order(see Fig. 3.5). The decoder places decoded frames in abuffer called the Decoded Picture Buffer (DPB). Referenceframes are kept in the DPB by the decoder for as long asrequired in favor of inter prediction. The order by whichthe frames should be displayed (i.e. presented) is calledPresentation Order, and it is usually different than thedecoding order (see Fig. 3.4). In the MPEG2 and H.264video standards, each video frame has two 90KHz timestamps associated with it. One time stamp indicates thetime when the frame should be decoded, called DecodingTime Stamp (DTS), and one which indicates the timewhen the frame should be displayed, called PresentationTime Stamp (PTS).

In order for encoders of motion-compensated-basedvideo standards to perform motion estimation by findingredundant image information Block-Matching Algorithms(BMA) are employed. BMAs are used to find a matchfor a block of pixels (MB or MB partition) of a frame iin another frame j. The match is chosen based on someerror metric, such as the mean squared error (MSE) or thesum of absolute differences (SAD). BMAs are computa-tionally intensive algorithms, especially when performedon videos with high resolution (e.g. HD videos) or framerate.

6

1

I

2

B

3

B

4

B

5

P

6

B

7

B

8

B

9

P

10

B

11

B

12

B

13

P

Fig. 3.4: A GOP of 13 frames in presentation order. The letter below each box denotes the frame’s type and thenumber within each box denotes the presentation order count (POC). The arrows denote references where the arrowhead points to the referencing frame and tail to the referenced frame.

1

I

5

P

2

B

3

B

4

B

9

P

6

B

7

B

8

B

13

P

10

B

11

B

12

B

Fig. 3.5: Same GOP in Fig. 3.4, but in decoding order. Note that in decoding order, frames only reference frameswhich precede them.

3.6 The H.264/MPEG-4 AVC Video CompressionStandard

H.264/MPEG-4 AVC is an advanced block-orientedmotion-compensated-based video compression format[44], and the successor of the popular MPEG-2 format.H.264/MPEG-4 AVC offers several advancements overolder video compression formats. Specifically, in regardsto motion compensation (MC), it offers the followingnew features:• Allowing up to 16 reference frames per encoded

frame• Allowing B-frames to be used as reference frames• Variable block sizes, with blocks as big as 16 × 16

and as small as 4× 4 samples• The ability to use multiple MVs per MB• Quarter-pixel precision for motion estimationH.264 adds a type of intra-coded frame called an In-

stantaneous Decoding Refresh (IDR) frame. IDR frames arefound at the beginning of each (closed) GOP. IDR framessignal that none of the pictures preceding them is needed(i.e. used as reference) for decoding subsequent picturesin the stream. These new enhancements of H.264 allowour proposed DFMC algorithm to perform the depthmap estimation in a more accurate and comprehensiveway than was possible with previous standards.

4 THE DEPTH FROM MOTION COMPENSATION(DFMC) ALGORITHM

Our algorithm primarily uses motion to estimate a depthmap for each video frame. The basic idea behind usingmotion to estimate depth is based on the fact that theprojected object velocity on the screen is correlated withits relative distance from the camera. This is due to thefact that distant objects have small parallax in compari-son to closer ones, see Fig. 3.3. The main objective of the

algorithm is to produce dense depth maps which aresuitable for 2D to 3D stereo conversion (e.g. via DIBR)in a computationally efficient way.

Since motion estimation is a computationally intensiveprocess, we use the motion-compensating data that is ex-tracted from the compressed video in the H.264 decodingstage. This makes it possible for the DFMC algorithmto be implemented on the decoder side, and to enableit to perform 2D to 3D video conversion in real-time.Moreover, our algorithm works with H.264 compressedvideos, while taking advantage of its advanced MCfeatures such as multiple references and quarter-pixelaccuracy (see section 3.6). The algorithm in 4.1 describesthe steps of our DFMC system and how it is incorporatedinto H.264.

4.1 Initial (Forward) Depth Estimation

Initial, also called forward, depth map estimation isachieved by assigning a depth value to each inter-predicted MB partition based on its associated MVs.Forward depth estimation for a MB partition is per-formed during the decoding process of the partition(see line 7 in Algorithm 4.1), since at this point wehave immediate access to the partition’s MVs, size andlocation. Similar to what was done in the related workby [11], [40], [41], [42], [43], we use the magnitude of theMVs to estimate the relative depth of an image pixelblock. However, instead of directly taking the scaledsize of each MV as the depth of a block, we normalizeit with the difference in presentation time between thecurrent frame and the frame referenced by that MV.We apply this normalization in order to account fora greater relative disparity of a frame’s pixel block incomparison to other pixel blocks. This is due to thereference frame of the block’s MV being further away inpresentation time. This normalization is applied under

7

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 4.1: A sequence of three consecutive video frames from a National Geographic video titled “Amazing FlightOver The Grand Canyon” (top row), with their respective depth maps as produced by the first step of the DFMCalgorithm (middle and bottom rows). The images in the second (middle) row are the frames’ depth maps withouttime normalization. The images in the third (bottom) row are the time-normalized depth maps.

(a) (b) (c)

Fig. 4.2: A video frame from a National Geographic video of a flight over The Grand Canyon and its respectivedepth maps. The second image (middle) is the frame’s depth map without time normalization. The third image(right), is the frame’s time-normalized depth map.

the assumption that the camera’s displacement betweentwo consecutive video frames, which belong to the samevideo scene, is approximately constant. Figure 4.1 showsthe resulting depth maps of 3 consecutive video frameswith and without time normalization. One can see thatthe time-normalized depth maps are much more tem-porally consistent than their unnormalized counterparts.Figure 4.2 shows that the time-normalized depth map ismore spatially consistent than its unnormalized version.

Denoting by s (P ) the time-normalized size of the MVs

associated with the pixel block (MB or MB partition) ofpixel P , then

s (P ) =

√mv2

x + mv2y

mv∆t

where mvx and mvy are the horizontal and verticalcomponents, respectively, of the MVs associated withthe block of pixel P , and mv∆t is the difference inpresentation time between the current frame and thereference frame calculated using the frames’ PTS.

8

Algorithm 4.1: 2D to 3D conversion process withDFMCInput: Next video frame F of video streamOutput: Side-by-side 3D version of input frame

1: D ← Empty depth map2: for M ← macroblocks of F do3: if M is inter-predicted then4: for P ← pixel partitions of M do5: V ← MV of partition P6: RD ← depth maps of reference frames7: Update depth of P in D based on V8: Update depth of P references in RD based

on V9: end for

10: end if11: Decode M12: Add F to DPB13: Add D to DMB14: N ← next frame to display15: if N is an IDR frame then16: while DDB is not empty do17: F ← next frame in DDB18: D ← next DM in DMB19: Apply spatial hole filling, cross bilateral

filtering and temporal smoothing to D20: L← DIBR of left-eye image from F and D21: Output side-by-side video frame using L

and F22: end while23: end if24: Add N to DDB25: end for

To account for times where the camera is at a standstill,we define a lower bound Tmin. When the value of s (P ) isbelow Tmin, the depth of pixel P is discarded. In order todisregard MVs, which refer to partitions not belongingto the same object, we also define an upper bound Tmax,where Tmax > Tmin. When the value of s (P ) is aboveTmax, we discard the estimation for pixel P .

Since H.264 supports weighted prediction (where ablock is predicted using more than one reference block),our algorithm uses the MV, which references a frame thatis closer in presentation time (according to the PTS), forthe depth estimated frame.

The initial depth estimation for a pixel P , where s (P )is within the bounds Tmin and Tmax, is given by

Pz = depth(P, c) = max

(round

(s (P )

c

), 1.0

)(4.1)

where c > 0 is a scaling factor, which is chosen em-pirically for the entire shot, or changed dynamicallyduring the estimation process. In our experiments, wechose c dynamically for each GOP, so that the maximumestimated depth in that GOP is 1, i.e.,

c = arg maxx

maxF∈G,P∈F

s (P )

where F ∈ G denotes the frame F of a GOP G andP ∈ F denotes the pixel P of frame F . Note that inorder to dynamically calculate c for each GOP, we needto delay the output of our decoder by one GOP. In ourexperiments, we chose Tmin = c/10 and Tmax = 30 aslower and upper bounds, respectively.

4.2 Backward Depth EstimationThe initial depth map, estimated in the first step (seeline 7 in Algorithm 4.1), provides a very partial depthestimation, since it only estimates the depth of pixelsbelonging to inter-predicted blocks. It does not provideestimation for pixels belonging to intra-predicted blockssince these do not have any MV associated with them.For example, I frames are strictly intra-coded and soevery pixel in an I frame belongs to an intra-predictedMB or MB partition that has no associated MV. Sincewe want to use the precalculated MC data encoded inthe video as much as possible when estimating pixels’depth, our algorithm also uses the MVs in a “back-ward” type, by using their size to assign depth to pixelsbelonging to the referenced blocks (see Fig. 4.3). Thisstep is also performed during the decoding process ofthe MB partition (see line 7 in Algorithm 4.1), sinceat this point we can also quickly access the reference’sdepth map. The same pixel can be referenced severaltimes by belonging to several referenced blocks or bybelonging to a block being referenced multiple times. Forsuch pixels, we use the MV that belongs to the closest(presentation-time-wise) referencing frame for that pixel.We then estimate its depth by using Eq. 4.1. This allowsour DFMC Algorithm 4.1 to estimate the depth of manypixels that are not inter-predicted. Figure 4.4 showsthat using backward estimation results in higher qualitydepth maps, which are more complete and less noisy,than the depth maps generated by using only forwardestimation.

In order to incorporate this backward depth estimationin the decoder, it stores the partial depth map of eachdecoded frame in the DPB as long as that frame isstill marked as a reference. Only when a frame is nolonger used as a reference, the decoder outputs its depthmap to the Depth Map Buffer (DMB). Since a frame canbe fully decoded prior to the completion of its depthmap estimation, we maintain a buffer of decoded framescalled the Depth Delay Buffer (DDB). The DMB and DDBbuffers are used to synchronize the depth estimationprocess with the decoding process. After the completionof a frame decoding process, we check if the next frameto display (according to the DDB) has a correspondingdepth map in the DMB. If it does, the DIBR process usesthe frame and its depth map to synthesize the stereoimages (see Algorithm 4.1).

Note that in order to perform this backward estimationby the decoder, it must delay the output by at most one

9

Fig. 4.3: The depth of pixels belonging to the I frame in the figure (dashed rectangles) is estimated using the MVsof the referencing P and B frames. The depth of intra-predicted pixels of the P frame in the figure are predictedusing the MVs of the referencing B frames.

Fig. 4.4: Three frames from the National Geographic video titled “Amazing Flight Over The Grand Canyon”. Thefirst (leftmost) image in each row is the source frame, the second (middle) image is the frame’s depth map usingonly forward estimation and the third (rightmost) image in each row is frame’s depth map estimated using bothforward and backward estimations. Note that the frame in the first (top) row is an I frame, therefore its forwardestimated depth map is empty.

GOP in order to keep updating the depth map of the firstframe in the GOP as new frames arrive into the DPB.In most cases, when the video is a broadcast stream,this does not pose a problem, since broadcast standardsdictate that IDR pictures cannot be too far apart. Forexample, the ATSC Standard [46] requires AVC transportstreams to contain at least one IDR frame every secondfrom a video.

4.3 Depth Hole Filling

In the previous steps (forward and backward depthestimation), we discarded the estimated depth for pixelsbelonging to MB partitions with MV sizes lower thansome predefined bound Tmin. In this step (see line 18 inAlgorithm 4.1), we set the depth of these pixels as beingequal to the depth of the same pixels in the precedingframe in presentation order. The assumption here isthat when the camera or the objects, which these pixelsbelong to, are not moving, their depth should be thesame as in the preceding frame. This step is performed

10

Fig. 4.5: Three frames from a National Geographic video of a flight over The Grand Canyon. The first (leftmost)image in each row is the source frame, the second (middle) image is the frame’s depth map without the hole-fillingstep and the third (rightmost) image in each row is the frame’s depth map after the application of the hole-fillingstep.

right before spatial and temporal smoothing. Figure 4.5provides three examples of depth maps before and afterthe hole filling step. The hole-filled depth maps are lessnoisy than the original depth maps.

4.4 Temporal and Spatial SmoothnessIt is important for a video scene’s depth to have bothspatial and temporal consistency, otherwise, the resulting3D video will create discomfort for the viewer. Process-ing the depth maps for spatial and temporal consistencyconstitutes the final step prior to DIBR (see line 19 inAlgorithm 4.1). To achieve spatial smoothness, our al-gorithm applies cross bilateral filtering [47] to the depthmap, with a Gaussian based weight distribution over theRGB image. Formally,

Ds (P ) =1

W (P )

∑Q∈N(P )Gσs (‖P −Q‖)Gσr (|I (P )− I (Q) |)D (Q)

where

W (P ) =∑

Q∈N(P )

Gσs (‖P −Q‖)Gσr (|I (P )− I (Q) |) ,

and Ds is the spatially smoothed depth value of pixelP . D and I denote the depth map and the RGB image,respectively, N (P ) denotes the neighbors of pixel P , andGσs

and Gσrdenote Gaussian distributions with an STD

of σs and σr, respectively. The distance in values between

two image pixels is defined as the Euclidean differencein RGB coordinates. Formally

|I (P )−I (Q) | =√

(Pr −Qr)2+ (Pg −Qg)2

+ (Pb −Qb)2.

In contrast to Gaussian filtering, bilateral filtering isedge preserving. Consequentially, it will not smoothdepth values of edges in the image. Moreover, we takeadvantage of the color information by cross filtering thedepth map with the input image. In our experiments, weused a kernel filter of size 5× 5, σs = 3 and σr = 0.15.

For temporal smoothness, a weighted average schemewas used. The temporally smoothed depth map of theith frame Si is computed by

Ds,ti =

2

M (M + 1)

M ·Dsi +

M−1∑j=1

(M − j) ·Dsi−j

where Ds,t

i is the temporally-spatially smoothed depthmap of frame i, Ds

i is the spatially smoothed depth mapof frame i and M is the number of frames to use fortemporal smoothing. We used M = 3 in our experiments.

4.5 Stereo Image Synthesis

Instead of using the source image and its depth mapto synthesize both the left eye image and the right eye

11

image, we use the source image as the right eye image,and only synthesize an image for the left eye. One benefitof this approach is that generating one syntactic imagerequires less computations than generating two syntacticimages. Moreover, it turns out that approximately two-thirds of the population is right-eye dominant [48], [49],[50], [51] meaning they have a tendency to prefer visualinput from the right eye, which is another benefit forusing a non-synthetic image as the right eye image.

To synthesize the left eye image from the source imageand its depth map we employ a DIBR scheme. The valueof a pixel P in the generated image is set to the valueof a pixel Q in the source image that resides in thehorizontally shifted coordinates of P . The number ofshifted pixels to apply is a function of the depth of pixelP (based on Eq. 3.1). We denote by x? the horizontalcoordinate of the pixel which is the source pixel of (x, y)in the synthesized image. Then:

x? = min(round

(x− k ·Ds,t (x, y)

), 0)

where k > 0 is a positive constant that depends on theviewer’s distance from the display, interaxial distanceand scene content, which is chosen empirically. Thevalue of a pixel at coordinates (x, y) in the left eye imageis thus given by L (x, y) = I (x?, y).

5 EXPERIMENTAL RESULTS

We implemented our DFMC algorithm and our DIBRprocedure as part of the FFmpeg H.264 video decoder.FFmpeg [13] is a popular open source project whichproduces libraries for the decoding and encoding ofmultimedia data. We then applied our stereo conversionvia the modified decoder to a variety of H.264 broadcastvideos. Using a standard MacBook Pro with a 2 GBquad-core i7 Intel processor with 16 GB of RAM, wewere able to perform the stereo conversions of eachvideo stream in real-time. Figures 5.1 to 5.3 show a col-lection of video frames along with their correspondingdepth maps as generated by the DFMC algorithm.

5.1 More Videos to WatchSide-by-Side 3D video files generated by Algorithm4.1 can be downloaded from our website athttp://sites.google.com/site/nirnirs/. Watching thiskind of videos requires a device which supports thedisplay of SbS (Side-by-Side) video, such as a 3Dtelevision (3DTV). Some devices require that the viewerwear special glasses (e.g. shuttered or polarized glasses)when watching SbS videos.

CONCLUSION

We described a low complexity motion based algorithmfor estimating depth maps for video frames. We alsodescribed a low complexity DIBR procedure for thesynthesis of stereo images from these depth maps. Wepresented a complete framework for incorporating the

depth estimation algorithm and the DIBR procedure intothe standard H.264 decoding process. Finally, we showedthat the resulting stereo conversion process is suitable forreal-time 2D to 3D conversion of H.264-based broadcastcontent.

ACKNOWLEDGMENT

This research was partially supported by the IsraeliMinistry of Science & Technology (Grants No. 3-9096, 3-10898) and by US - Israel Binational Science Foundation(BSF 2012282).

REFERENCES

[1] A. Torralba and A. Oliva, “Depth estimation from image struc-ture,” Pattern Analysis and Machine Intelligence, IEEE Transactionson, vol. 24, no. 9, pp. 1226–1238, 2002.

[2] C. B. Hochberg and J. E. Hochberg, “Familiar size and theperception of depth,” The Journal of Psychology, vol. 34, no. 1, pp.107–114, 1952.

[3] W. C. Gogel, “The effect of object familiarity on the perception ofsize and distance,” The Quarterly journal of experimental psychology,vol. 21, no. 3, pp. 239–247, 1969.

[4] W. C. Gogel and J. A. Da Silva, “Familiar size and the theory ofoff-sized perceptions,” Perception & Psychophysics, vol. 41, no. 4,pp. 318–328, 1987.

[5] M. S. Langer and H. H. Bulthoff, “Depth discrimination fromshading under diffuse lighting.” Perception, vol. 29, no. 6, pp. 649–660, 1999.

[6] H. H. Bulthoff and H. A. Mallot, “Integration of depth modules:stereo and shading,” JOSA A, vol. 5, no. 10, pp. 1749–1758, 1988.

[7] P. Li, D. Farin, R. K. Gunnewiek, and P. H. de With, “Oncreating depth maps from monoscopic video using structure frommotion,” Proceedings of 27th Symposium on Information Theory in theBenelux, pp. 508–515, 2006.

[8] W. Liu, Y. Wu, F. Guo, and Z. Hu, “An efficient approach for 2d to3d video conversion based on structure from motion,” The VisualComputer, pp. 1–14, 2013.

[9] Y.-L. Chang, C.-Y. Fang, L.-F. Ding, S.-Y. Chen, and L.-G. Chen,“Depth map generation for 2d-to-3d conversion by short-termmotion assisted color segmentation,” in Multimedia and Expo, 2007IEEE International Conference on. IEEE, 2007, pp. 1958–1961.

[10] L.-M. Po, X. Xu, Y. Zhu, S. Zhang, K.-W. Cheung, and C.-W. Ting,“Automatic 2d-to-3d video conversion technique based on depth-from-motion and color segmentation,” in Signal Processing (ICSP),2010 IEEE 10th International Conference on. IEEE, 2010, pp. 1000–1003.

[11] X. Huang, L. Wang, J. Huang, D. Li, and M. Zhang, “A depthextraction method based on motion and geometry for 2d to 3dconversion,” in Intelligent Information Technology Application, 2009.IITA 2009. Third International Symposium on, vol. 3. IEEE, 2009,pp. 294–298.

[12] N. Shabat, “Real-time 2d to 3d from h.264 video compression,”M.Sc. thesis, Tel-Aviv University, Israel, 2014.

[13] F. Bellard, M. Niedermayer et al., “Ffmpeg,” http://ffmpeg. org, 2014.[14] Q. Wei, “Converting 2d to 3d: A survey,” in International Confer-

ence, Page (s), vol. 7, 2005, p. 14.[15] W. J. Tam and L. Zhang, “3d-tv content generation: 2d-to-3d

conversion,” in Multimedia and Expo, 2006 IEEE International Con-ference on. IEEE, 2006, pp. 1869–1872.

[16] S. Mohan and L. M. Mani, “Construction of 3d models fromsingle view images: a survey based on various approaches,” inEmerging Trends in Electrical and Computer Technology (ICETECT),2011 International Conference on. IEEE, 2011, pp. 557–562.

[17] L. Zhang, C. Vazquez, and S. Knorr, “3d-tv content creation: auto-matic 2d-to-3d video conversion,” Broadcasting, IEEE Transactionson, vol. 57, no. 2, pp. 372–383, 2011.

[18] M. Guttmann, L. Wolf, and D. Cohen-Or, “Semi-automatic stereoextraction from video footage,” in Computer Vision, 2009 IEEE 12thInternational Conference on. IEEE, 2009, pp. 136–142.

12

(a) Source frame (b) Estimated depth map

(c) Frame’s Anaglyph

Fig. 5.1: Example result from the movie “X-Men: Days of Future Past”

[19] X. Cao, Z. Li, and Q. Dai, “Semi-automatic 2d-to-3d conversionusing disparity propagation,” Broadcasting, IEEE Transactions on,vol. 57, no. 2, pp. 491–499, 2011.

[20] O. Wang, M. Lang, M. Frei, A. Hornung, A. Smolic, and M. Gross,“Stereobrush: interactive 2d to 3d conversion using discontinuouswarps,” in Proceedings of the Eighth Eurographics Symposium onSketch-Based Interfaces and Modeling. ACM, 2011, pp. 47–54.

[21] C. Wu, G. Er, X. Xie, T. Li, X. Cao, and Q. Dai, “A novel method forsemi-automatic 2d to 3d video conversion,” in 3DTV Conference:The True Vision-Capture, Transmission and Display of 3D Video, 2008.IEEE, 2008, pp. 65–68.

[22] S. Battiato, S. Curti, M. La Cascia, M. Tortora, and E. Scordato,“Depth map generation by image classification,” Proceedings ofSPIE, vol. 5302, pp. 95–104, 2004.

[23] B. Liu, S. Gould, and D. Koller, “Single image depth estimationfrom predicted semantic labels,” in Computer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp.1253–1260.

[24] A. Saxena, S. H. Chung, and A. Ng, “Learning depth from singlemonocular images,” Neural Information Processing Systems (NIPS),vol. 18, p. 1161, 2005.

[25] A. Saxena, J. Schulte, and A. Y. Ng, “Depth estimation usingmonocular and stereo cues.” IJCAI, vol. 7, 2007.

[26] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Depth perception froma single still image,” in Proceedings of the 23rd national conferenceon Artificial intelligence (AAAI), vol. 3, 2008, pp. 1571–1576.

[27] ——, “Make3d: Learning 3d scene structure from a single stillimage,” Pattern Analysis and Machine Intelligence (PAMI), IEEETransactions on, vol. 31, no. 5, pp. 824–840, 2009.

[28] A. Saxena, S. H. Chung, and A. Y. Ng, “3-d depth reconstructionfrom a single still image,” International Journal of Computer Vision,vol. 76, no. 1, pp. 53–69, 2008.

[29] J. Konrad, M. Wang, and P. Ishwar, “2d-to-3d image conversionby learning depth from examples,” in Computer Vision and Pat-tern Recognition Workshops (CVPRW), 2012 IEEE Computer SocietyConference on. IEEE, 2012, pp. 16–22.

[30] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.IEEE, 2005, pp. 886–893.

[31] K. Karsch, C. Liu, and S. B. Kang, “Depth extraction from videousing non-parametric sampling,” in Computer Vision–ECCV 2012.Springer, 2012, pp. 775–788.

[32] A. Oliva and A. Torralba, “Modeling the shape of the scene:A holistic representation of the spatial envelope,” Internationaljournal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.

[33] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondenceacross scenes and its applications,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 33, no. 5, pp. 978–994, 2011.

[34] M. T. Pourazad, P. Nasiopoulos, and A. Bashashati, “Randomforests-based 2d-to-3d video conversion,” in Electronics, Circuits,and Systems (ICECS), 2010 17th IEEE International Conference on.IEEE, 2010, pp. 150–153.

[35] D. A. Forsyth and J. Ponce, Computer vision: a modern approach.Prentice Hall Professional Technical Reference, 2002.

[36] R. Hartley and A. Zisserman, Multiple view geometry in computervision. Cambridge university press, 2003.

[37] G. Zhang, J. Jia, T.-T. Wong, and H. Bao, “Recovering consistentvideo depth maps via bundle optimization,” in Computer Visionand Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.IEEE, 2008, pp. 1–8.

[38] ——, “Consistent depth maps recovery from a video sequence,”Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 31, no. 6, pp. 974–988, 2009.

13



Fig. 5.2: Sample result from the movie “Batman: The Dark Knight”

[39] J. Pearl, “Reverend bayes on inference engines: A distributedhierarchical approach,” in AAAI, 1982, pp. 133–136.

[40] I. Ideses, L. P. Yaroslavsky, and B. Fishbain, “Real-time 2d to 3dvideo conversion,” Journal of Real-Time Image Processing, vol. 2,no. 1, pp. 3–9, 2007.

[41] M. T. Pourazad, P. Nasiopoulos, and R. K. Ward, “An h. 264-basedscheme for 2d to 3d video conversion,” Consumer Electronics, IEEETransactions on, vol. 55, no. 2, pp. 742–748, 2009.

[42] M. Pourazad, P. Nasiopoulos, and R. Ward, “Conversion of h. 264-encoded 2d video to 3d format,” in Consumer Electronics (ICCE),2010 Digest of Technical Papers International Conference on. IEEE,2010, pp. 63–64.

[43] T.-I. Lin, C.-Y. Yeh, W.-C. Chen, and W.-N. Lie, “A 2d to 3d con-version scheme based on depth cues analysis for mpeg videos,”in Multimedia and Expo (ICME), 2010 IEEE International Conferenceon. IEEE, 2010, pp. 1141–1145.

[44] I. E. Richardson, H. 264 and MPEG-4 video compression: video codingfor next-generation multimedia. John Wiley & Sons, 2004.

[45] B. G. Lindsay, “Mixture models: theory, geometry and applica-tions,” in NSF-CBMS regional conference series in probability andstatistics. JSTOR, 1995, pp. i–163.

[46] A. T. S. Committee et al., “Atsc digital television standard,”Document A/72 Part 2, Feb, 2014.

[47] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and colorimages,” in Computer Vision, 1998. Sixth International Conference on.IEEE, 1998, pp. 839–846.

[48] I. Eser, D. S. Durrie, F. Schwendeman, and J. E. Stahl, “Associationbetween ocular dominance and refraction.” Journal of refractivesurgery (Thorofare, NJ: 1995), vol. 24, no. 7, pp. 685–689, 2008.

[49] W. H. Ehrenstein, B. E. Arnold-Schulz-Gahmen, and W. Jaschin-ski, “Eye preference within the context of binocular functions,”

Graefe’s Archive for Clinical and Experimental Ophthalmology, vol.243, no. 9, pp. 926–932, 2005.

[50] M. Reiss and G. Reiss, “Ocular dominance: some family data,”Laterality: Asymmetries of Body, Brain and Cognition, vol. 2, no. 1,pp. 7–16, 1997.

[51] B. Chaurasia and B. Mathur, “Eyedness,” Cells Tissues Organs,vol. 96, no. 2, pp. 301–305, 1976.

Nir Shabat works as a Software Engineer atGoogle. Nir was born in Tel Aviv, Israel. Hereceived both his BS.c and MS.c degrees inComputer Science with honors from Tel AvivUniversity, in 2003 and 2014, respectively. Hisresearch interests include machine learning, bigdata processing and computer vision.

14



Fig. 5.3: Example result from the movie “Batman: The Dark Knight”

Gil Shabat Gil Shabat is a postdoctoral fellowin the Department of Computer Science in TelAviv University. Gil was born in Tel Aviv, Israel.He received his B.Sc, M.Sc and Ph.D degreesin Electrical Engineering all from Tel-Aviv univer-sity, in 2002, 2008 and 2015, respectively. Priorto his PhD research, Gil was working in AppliedMaterials as a computer vision algorithm re-searcher. His research interests include low rankapproximations, fast randomized algorithms, sig-nal/image processing and machine learning.

Amir Averbuch was born in Tel Aviv, Israel. Hereceived the B.Sc and M.Sc degrees in Mathe-matics from the Hebrew University in Jerusalem,Israel in 1971 and 1975, respectively. He re-ceived the Ph.D degree in Computer Sciencefrom Columbia University, New York, in 1983.During 1966-1970 and 1973-1976 he served inthe Israeli Defense Forces. In 1976-1986 hewas a Research Staff Member at IBM T.J. Wat-son Research Center, Yorktown Heights, in NY,Department of Computer Science. In 1987, he

joined the School of Computer Science, Tel Aviv University, where heis now Professor of Computer Science. His research interests includeapplied harmonic analysis, big data processing and analysis, wavelets,signal/image processing, numerical computation and scientific comput-ing (fast algorithms).

Real-Time 2D to 3D Conversion from H.264 Video Compressionamir1/PS/3d.pdf · 1 Real-Time 2D to 3D...

Documents

Transcript of Real-Time 2D to 3D Conversion from H.264 Video Compressionamir1/PS/3d.pdf · 1 Real-Time 2D to 3D...