2008 Matthieu Maitre - University Of...

139
2008 Matthieu Maitre

Transcript of 2008 Matthieu Maitre - University Of...

© 2008 Matthieu Maitre

“TRAVELLING WITHOUT MOVING”:A STUDY ON THE RECONSTRUCTION, COMPRESSION, ANDRENDERING OF 3D ENVIRONMENTS FOR TELEPRESENCE

BY

MATTHIEU MAITRE

Diplome d’ingenieur, Ecole Nationale Superieure de Telecommunications, 2002M.S., University of Illinois at Urbana-Champaign, 2002

DISSERTATION

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and Computer Engineering

in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2008

Urbana, Illinois

Doctoral Committee:

Assistant Professor Minh N. Do, ChairYoshihisa Shinagawa, Siemens Medical Solutions USA, IncProfessor Douglas L. JonesProfessor Thomas HuangProfessor Narendra Ahuja

ABSTRACT

In this dissertation, we study issues related to free-view 3D-video, and in partic-

ular issues of 3D scene reconstruction, compression, and rendering. We present

four main contributions. First, we present a novel algorithm which performs sur-

face reconstruction from planar arrays of cameras and generates dense depth maps

with multiple values per pixel. Second, we introduce a novel codec for the static

depth-image-based representation, which jointly estimates and encodes the un-

known depth map from multiple views using a novel rate-distortion optimization

scheme. Third, we propose a second novel codec for the static depth-image-based

representation, which relies on a shape-adaptive wavelet transform and an ex-

plicit representation of the locations of major depth edges to achieve major rate-

distortion gains. Finally, we propose a novel algorithm to extract the side infor-

mation in the context of distributed video coding of 3D scenes.

ii

To my families.

iii

ACKNOWLEDGMENTS

This thesis would not have been possible without the help and inspiration from

many people. First and foremost I would like to thank my thesis advisers, Profes-

sors Minh N. Do and Yoshihisa Shinagawa, for their guidance and precious advice

during the course of this study. I am also grateful to my committee members –

Professors Thomas Huang, Narendra Ahuja, and Douglas L. Jones – for the help

they gave me in defining the scope of this thesis.

I would like to thank the mentors I had the pleasure to work with during my

internships: Christine Guillemot and Luce Morin of the Irisa, and Michelle Yan,

Yunqiang Chen, and Tong Fang of Siemens Corporate Research (SCR).

I would like to express my appreciation to Jean Tourret, Robert West and

Professors Yizhou Yu, Bruce Hajek and Daniel M Liberzon for the suggestions they

offered me. I am grateful to my labmates at the Coordinated Science Laboratory

(CSL), the Beckman Institute, SCR, and Irisa for their assistance and friendships,

which made my stay in all these places most enjoyable.

I am also indebted to the staff of the Beckman Institute and CSL, and in

particular to John M. Hart, Barbara Horner, and Dan R. Jordan, for having taken

care of so many material details that made my work much easier. I would also like

to express my gratitude to the University of Illinois in general, for providing such

a rich and fulfilling research environment.

Although I never had the pleasure to meet them in person, I am grateful to

Frank Herbert [1] and Jamiroquai [2] for helping me find a title for this thesis.

iv

Finally, on the personal side, I would like to thank my three families: my wife,

for her love and patience; my parents and sister, for their constant encouragement;

and my parents-in-law, for all the enjoyable moments I spend with them.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . x

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

CHAPTER 2 SYMMETRIC STEREO RECONSTRUCTIONFROM PLANAR CAMERA ARRAYS . . . . . . . . . . . . . . 82.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Relation to Previous Work . . . . . . . . . . . . . . . . . . . . . . . 92.3 The Rectified Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Rectification homographies . . . . . . . . . . . . . . . . . . . 13

2.4 Stereo Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Geometric cost volume G(m,n) . . . . . . . . . . . . . . . . . 182.4.3 Photometric cost volume P (m,n) . . . . . . . . . . . . . . . . 19

2.5 Global Surface Representation . . . . . . . . . . . . . . . . . . . . . 222.5.1 Layered depth image . . . . . . . . . . . . . . . . . . . . . . 222.5.2 Sprites with depth . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 252.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

CHAPTER 3 WAVELET-BASED JOINT ESTIMATION ANDENCODING OF DIBR . . . . . . . . . . . . . . . . . . . . . . . 313.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Rate-Distortion Optimization . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.2 Reference view . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.3 One-dimensional disparity map . . . . . . . . . . . . . . . . 43

vi

3.3.4 Dynamic programming . . . . . . . . . . . . . . . . . . . . . 463.3.5 Two-dimensional disparity map . . . . . . . . . . . . . . . . 513.3.6 Bitrate optimization . . . . . . . . . . . . . . . . . . . . . . 533.3.7 Quality scalability . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 553.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

CHAPTER 4 JOINT ENCODING OF THE DIBR USING SHAPE-ADAPTIVE WAVELETS . . . . . . . . . . . . . . . . . . . . . . 654.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2 Proposed Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Shape-Adaptive Wavelet Transform . . . . . . . . . . . . . . . . . . 684.4 Lifting Edge Handling . . . . . . . . . . . . . . . . . . . . . . . . . 724.5 Edge Representation and Coding . . . . . . . . . . . . . . . . . . . 734.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 754.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

CHAPTER 5 3D MODEL-BASED FRAME INTERPOLATIONFOR DVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.2 3D Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.3 Camera parameter estimation . . . . . . . . . . . . . . . . . 845.2.4 Correspondence estimation . . . . . . . . . . . . . . . . . . . 875.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.3 3D Model-Based Interpolation . . . . . . . . . . . . . . . . . . . . . 915.3.1 Projection-matrix interpolation . . . . . . . . . . . . . . . . 925.3.2 Frame interpolation based on epipolar blocks . . . . . . . . . 925.3.3 Frame interpolation based on 3D meshes . . . . . . . . . . . 945.3.4 Comparison of the motion models . . . . . . . . . . . . . . . 95

5.4 3D Model-Based Interpolation with Point Tracking . . . . . . . . . 975.4.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4.2 Tracking at the decoder . . . . . . . . . . . . . . . . . . . . 985.4.3 Tracking at the encoder . . . . . . . . . . . . . . . . . . . . 99

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.5.1 Frame interpolation without tracking (3D-DVC) . . . . . . . 1015.5.2 Frame interpolation with tracking at the encoder (3D-DVC-

TE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.5.3 Frame interpolation with tracking at the decoder (3D-DVC-

TD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5.4 Rate-distortion performances . . . . . . . . . . . . . . . . . 107

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

vii

CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . 111

APPENDIX A FIXING THE PROJECTIVE BASIS . . . . . . 112

APPENDIX B BUNDLE ADJUSTMENT . . . . . . . . . . . . . 113

APPENDIX C PUBLICATIONS . . . . . . . . . . . . . . . . . . 115C.1 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115C.2 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115C.3 Research Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

AUTHOR’S BIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . 125

viii

LIST OF TABLES

2.1 Performances on the Middlebury dataset with two cameras (fromtop to bottom: percentage of erroneous disparities over all areasfor the proposed method, percentage for the best method on eachimage, and ranks of the proposed method). . . . . . . . . . . . . . . 27

2.2 Percentage of erroneous disparities over all areas on Tsukuba forseveral multicamera methods. The proposed method achieves com-petitive error rates and scales with the number of cameras. . . . . . 27

2.3 Number of disparity values in a standard disparity map and in anLDI, for various numbers of cameras on Tsukuba. Using an LDI and25 cameras increases the area of reconstructed surfaces by almost20%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Analysis and synthesis operators of the Laplace (L) and Sequential(S) transforms (see text for details). . . . . . . . . . . . . . . . . . . 44

5.1 Average PSNR (in dB) of interpolated frames using lossless keyframes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

ix

LIST OF FIGURES

1.1 Two state-of-the-art telepresence systems. This dissertation intro-duces methods aimed at enabling realistic, interactive, and large-scale telepresence systems (images reproduced from [3, 4]). . . . . . 2

1.2 Overview of the proposed 3D-video system. At each client, a 3Dscene is recorded by multiple cameras whose data is compressedusing images and depth models. This information is sent to thenetwork, along with data from other input devices. The networkaggregates across clients and performs physics-based simulation be-fore sending the data back to the clients. Each client then ren-ders the data and displays. Users are able to freely choose theviewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Research areas related to free-view 3D-video. . . . . . . . . . . . . . 5

2.1 A few rays of light in the rectified 3D space: rays passing throughthe optical centers of camera (0, 0) (a) and camera (1, 0) (b). Therays are aligned with the voxel grid, which simplifies visibility com-putations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Rectification of four images from the toy sequence. After rectifica-tion, both the rows and the columns of the images are aligned. . . . 16

2.3 Camera MRF associated with a 2 × 4 camera array. Each noderepresents a camera with an observed image I and a hidden disparitymap D. Edges represent fusion functions. . . . . . . . . . . . . . . . 16

2.4 A simple example demonstrating the behavior of the occlusion model.Perfect disparity maps are obtained in two iterations. . . . . . . . . 20

2.5 Cropped disparity maps computed on Tsukuba with five camerasforming a cross. The proposed photometric cost reduces the dispar-ity errors due to partial occlusions. . . . . . . . . . . . . . . . . . . 20

2.6 The 3-layer LDI obtained on Tsukuba with 25 cameras. By treatingall the cameras symmetrically, the proposed algorithm recovers largeareas, which may be occluded in the central camera. . . . . . . . . . 22

2.7 Examples of sprites extracted from the LDI of Tsukuba with 25cameras. Note the absence of occlusion on the cans. . . . . . . . . . 24

2.8 Disparity map obtained from the four rectified images of the toysequence shown in Figure 2.2. . . . . . . . . . . . . . . . . . . . . . 25

x

2.9 Disparity maps obtained on the Middlebury dataset with two cam-eras. The occlusion model leads to sharp and accurate depth dis-continuities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.10 Number of disparity values per pixel on Tsukuba (black: no value,white: 3 values). The area of the reconstructed surfaces increaseswith the number of cameras. . . . . . . . . . . . . . . . . . . . . . . 28

2.11 Cropped textures extracted from the LDIs of Tsukuba. Occlusionsshrink when the number of cameras increases. . . . . . . . . . . . . 29

3.1 Overview of the proposed codec: the encoder takes multiple viewsand jointly estimates and encodes a depth map together with areference image (the DIBR). The output DIBR can be used to renderfree viewpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 The spatial extent of a ROI (sphere) with one pair of image anddepth map, along with seven views (cones). The central dark conedesignates the reference view. The planes represent iso-depth sur-faces (3D model reproduced with permission from Google 3D Ware-house). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 The projection of an iso-depth plane onto two views gives rise to amotion field between the two which is a 2D homography. . . . . . . 38

3.4 An error matrix E from the Tsukuba image set with two optimalpaths overlaid, λ = 0 (dashed) and λ = ∞ (solid). Lighter shadesof gray indicate larger squared intensity differences. . . . . . . . . . 44

3.5 Dependency graph of a three-level L transform. The coefficientsin bold are those included in the wavelet vector d. Gray nodesrepresent the MSE and rate terms of the RD optimization. Thedashed box highlights the two-level L transform associated with(3.22). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6 Dependency graph of a three-level S transform. The coefficientsin bold are those included in the wavelet vector d. Gray nodesrepresent the MSE and rate terms of the RD optimization. . . . . . 47

3.7 Two divisions of the frequency plane and the associated graphs ofdependencies between the coefficients of the S transform. . . . . . . 52

3.8 The two sets of images used in the experiments. . . . . . . . . . . . 563.9 Disparity map of the Teddy set at four resolution levels, showing

the resolution scalability of the wavelet-based representation. . . . . 573.10 The DIBR of the Teddy set at three RD slopes corresponding to

reference-view bitrates of 0.1 bpp, 0.5 bpp, and 1.0 bpp (from leftto right). The S and L transforms generate disparity maps thatdegrade gracefully with the bitrate and contain less spurious noisethan quadtrees or blocks. . . . . . . . . . . . . . . . . . . . . . . . . 59

xi

3.11 The DIBR of the Tsukuba set at three RD slopes corresponding toreference-view bitrates of 0.1 bpp, 0.5 bpp, and 1.0 bpp (from leftto right). The S and L transforms generate disparity maps thatdegrade gracefully with the bitrate and contain less spurious noisethan quadtrees or blocks. . . . . . . . . . . . . . . . . . . . . . . . . 60

3.12 Views synthesized from the DIBR with a reference view encoded at0.5 bpp and differences with the original views. At low quantizationnoise, the errors are mostly due to occlusions and disocclusions. . . 61

3.13 Rate-distortion performances of the encoders based on wavelets (Sand L transforms), quadtrees, and blocks. Wavelets are superior toquadtrees and blocks in the case of larger disparity ranges. . . . . . 62

3.14 RD loss due to quality-scalable coding. The loss remains limitedover the whole range of bitrates. . . . . . . . . . . . . . . . . . . . . 63

3.15 Fraction of the bitrate allocated to the disparity maps. Except atvery low bitrates, the rate ratios are stable with values between 13%and 23%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Input data of the proposed DIBR codec: shared edges superimposedover a depth map (a) and an image (b). . . . . . . . . . . . . . . . . 66

4.2 Overview of the proposed encoder. It relies on a SA-DWT and anedge coder (gray boxes) to reduce data correlations, both withinand between the image and the depth map. . . . . . . . . . . . . . 68

4.3 Comparison of standard and shape adaptive DWTs. In the lattercase, all but the coarsest high-pass band are zero. . . . . . . . . . . 70

4.4 The four lifting steps associated with a 9/7 wavelet, which trans-form the signal x first into a and then into y. The values x2t+2

and a2t+2 on the other side of the edge (dashed vertical line) areextrapolated. They have dependencies with the values inside thetwo dashed triangles. . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5 Example of the dual lattices of samples and edges. Each edge in-dicates the statistical independence of the two half rows or halfcolumns of samples it separates. . . . . . . . . . . . . . . . . . . . . 74

4.6 Absolute values of the high-pass coefficients of the depth map usingstandard and shape-adaptive wavelets. The latter provides a muchsparser decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Reconstruction of the depth map at 0.04 bpp using standard andshape-adaptive wavelets. The latter gives sharp edges free of Gibbsartifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.8 Rate-distortion performances of standard and shape-adaptive wave-lets. The latter gives PSNR gains of up to 5.46 dB on the depthmap and 0.19 dB on the image. . . . . . . . . . . . . . . . . . . . . 78

5.1 Outline of the codec without point tracking (3D-DVC). The pro-posed codec benefits from an improved motion estimation and frameinterpolation (gray boxes). . . . . . . . . . . . . . . . . . . . . . . . 83

xii

5.2 Outline of the 3D model construction. . . . . . . . . . . . . . . . . 835.3 Correspondences and epipolar geometry between the two first loss-

less key frames of the sequences street and stairway. Feature pointsare represented by red dots, motion vectors by magenta lines endingat feature points, and epipolar lines by green lines centered at thefeature points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Trifocal transfer for epipolar block interpolation. . . . . . . . . . . . 935.5 Outline of the frame interpolation based on epipolar blocks. . . . . 945.6 Outline of the frame interpolation based on 3D meshes. . . . . . . . 955.7 Norm of the motion vectors between the first two lossless key frames

of the stairway sequence for epipolar block matching (a), 3D meshfitting (b), and classical 2D block matching (c). . . . . . . . . . . . 96

5.8 Outline of the codec with tracking at the decoder (3D-DVC-TD). . 985.9 Outline of the codec with tracking at the encoder (3D-DVC-TE). . 995.10 Correspondences and epipolar geometry between the two first key

frames of the sequence statue. Feature points are represented byred dots, motion vectors by magenta lines ending at feature points,and epipolar lines by green lines centered at the feature points. . . . 101

5.11 PSNR of interpolated frames using lossless key frames (from topto bottom: sequences street, stairway, and statue). Missing pointscorrespond to key frames (infinite PSNR). . . . . . . . . . . . . . . 103

5.12 Correlation noise for GOP 1, frame 5 (center of the GOP) of thestairway sequence, using lossless key frames: 2D-DVC with classicalblock matching (a), 3D-DVC with mesh model and linear tracks (b),3D-DVC-TE with mesh model and tracking at the encoder (c), and3D-DVC-TD with mesh model and tracking at the decoder (d). Thecorrelation noise is the difference between the interpolated frame andthe actual WZ frame. . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.13 PSNR of key frames and interpolated frames of the street sequenceusing 3D-DVC-TE with mesh fitting on lossy key frames. Peakscorrespond to key frames. . . . . . . . . . . . . . . . . . . . . . . . 105

5.14 Variation of the subjective quality at QP = 26 between a key frame(frame 1) and an interpolated frame (frame 5). In spite of a PSNRdrop of 8.1 dB, both frames have a similar subjective quality. . . . . 106

5.15 Rate-distortion curves for H.264/AVC intra, H.264/AVC inter-IPPPwith null motion vectors, H.264/AVC inter-IPPP, 2D-DVC I-WZ-I, and the three proposed 3D codecs (top left: street, top right:stairway, bottom: statue). . . . . . . . . . . . . . . . . . . . . . . . 108

xiii

CHAPTER 1

INTRODUCTION

1.1 Motivation

Travel by physical motion was the only way human beings originally had to ex-

perience and modify the world that surrounded them. However, physical motion

suffers from several issues, the most conspicuous one probably being its slow speed.

Even using the fastest planes, intercontinental flights still take several hours. Fa-

tigue is also an issue. Long-haul travelers suffer from jet lag, and drivers lose their

attention after just a few hours of driving. Moreover, physical motion has high

energetic requirements, especially from fossil fuels, which strain environmental and

geopolitical equilibria.

Traveling without moving [1,2] – that is, being able to go to any place instantly

– would solve all these issues. Unfortunately, this has so far only been possible in

works of fiction [1]. The next best thing is then virtual travel, where a proxy, say

eletromagnetic waves, moves while we stay in place. This is the fundamental idea

behind telecommunication.

Through inventions like the telegraph, the telephone, the television, and the

internet, to name a few, our ability to shift stimuli like sound and light from one

place to another has been greatly improved. It has reached a point where both

places, the local one and the distant one, can look as if they only formed one,

like in the teleconference system shown in Figure 1.1(a). We are therefore getting

1

(a) HP’s Halo: a highly realistic system withlimited interactivity within room-size worlds.

(b) Linden Lab’s Second Life: a highly in-teractive system with limited realism withinlarge-scale worlds.

Figure 1.1: Two state-of-the-art telepresence systems. This dissertation intro-duces methods aimed at enabling realistic, interactive, and large-scale telepresencesystems (images reproduced from [3, 4]).

closer to a complete telepresence experience, in which it would not be possible to

distinguish the real world from its reproduced version.

At least three shortcomings remain in current telepresence technologies. First,

they are not able to convey stereopsis, that is, the sensation of depth. Videos are

still overwhelmingly limited to two spatial dimensions. Second, they do not offer

mobility. Users usually have to view the distant environment from a fixed point of

view, that of the camera, and cannot move inside this environment. Third, they

provide limited interactivity, only shifting stimuli from one place to another. A

full telepresence system would also shift actions, to let users modify the distant

environment.

These shortcomings would be simple to solve if the distant environment was

a virtual one, like those created for video games or the online 3D world shown

in Figure 1.1(b). In virtual environments, the stimuli delivered to the users are

rendered from a mathematical representation of these environments. Stereopsis

is then simply achieved by rendering from two slightly different points of view,

mobility amounts to applying a rigid transformation to the data, and interactivity

2

is obtained by simulating the laws of physics and transforming the data in an

appropriate manner.

Virtual environments have an additional advantage over real ones: the laws

of physics which govern them can be freely designed. They therefore offer new

possibilities, like providing safe learning environments where one can never get

hurt, be injured, or die. In such environments, professionals like surgeons, chemists,

firefighters, military personel, etc., can receive training which would not be possible

in the real world.

The shortcoming of existing virtual environments lies in their lack of realism:

they look too synthetic to make telepresence a believable experience. If we could

find a way to integrate the data from telepresence systems inside virtual environ-

ments, or at least the most important data, we would obtain a trade-off achieving

at the same time realism, stereopsis, mobility, and interactivity.

Developing such technologies would also be beneficial to other applications,

including 3D television. This recent televisual technology, which conveys stereopsis

to the viewers and may give some degree of freedom in the choice of the point of

view, is seen as the next evolution of television after high-definition. It has recently

received a lot of interest, both from the industry [5,6], academic institutions [7–9],

and standardization organizations [10].

1.2 Problem Statement

In this thesis, we focus on the problem of integrating real objects into virtual

worlds, and study in particular its visual aspects. The major issue at hand is the

massive amount of data needed to represent the visual characteristics of objects.

Fortunately, the space in which visual representations lie, called the plenoptic

function [11], has a strong structure that we can take advantage of to obtain

3

Figure 1.2: Overview of the proposed 3D-video system. At each client, a 3Dscene is recorded by multiple cameras whose data is compressed using images anddepth models. This information is sent to the network, along with data from otherinput devices. The network aggregates across clients and performs physics-basedsimulation before sending the data back to the clients. Each client then rendersthe data and displays. Users are able to freely choose the viewpoint.

manageable representations. We follow a hybrid geometric/photometric approach,

which allows the scene to be recorded from a reduced number of cameras and

enables compact data representations, at the expense of lesser realism.

The proposed 3D video system includes the different components shown in Fig-

ure 1.2: multiple view recording, 3D scene reconstruction, compression for stor-

age/transmission, rendering, and display. In this thesis, we focus on the aspects of

3D reconstruction, compression, and rendering. The main issue here comes from

the ill-posed nature of the 3D reconstruction, which makes it difficult to obtain

reliable 3D models able to efficiently approximate the plenoptic function.

4

Figure 1.3: Research areas related to free-view 3D-video.

1.3 Prior Art

As shown in Figure 1.3, free-view 3D-video is at the crossroad of multiple research

areas, among which are digital imaging, computer vision, image and video process-

ing, information theory, coding theory, computer graphics, and 3D displays.

There is a considerable amount of prior art in each of these research areas

and books are available on topics covering computer vision [12, 13], 3D recon-

struction [14–16], information theory [17], image processing [18,19], video process-

ing [20, 21], and computer graphics [22, 23]. The prior art in the specific context

of free-view 3D-video is much more limited. A comprehensive review is presented

in [24].

1.4 Thesis Outline

This thesis describes three main contributions. In Chapter 2, we present a novel

stereo algorithm that performs surface reconstruction from planar camera arrays.

It incorporates the merits of both generic camera arrays and rectified binocular

5

setups, recovering large surfaces like the former and performing efficient computa-

tions like the latter. First, we introduce a rectification algorithm which gives free-

dom in the design of camera arrays and simplifies photometric and geometric com-

putations. We then define a novel set of data-fusion functions over 4-neighborhoods

of cameras, which treat all cameras symmetrically and enable standard binocular

stereo algorithms to handle arrays with an arbitrary number of cameras. In par-

ticular, we introduce a photometric fusion function that handles partial visibility

and extracts depth information along both horizontal and vertical baselines. Fi-

nally, we show that layered depth images and sprites with depth can be efficiently

extracted from the rectified 3D space. Experimental results on real images confirm

the effectiveness of the proposed method, which reconstructs dense surfaces 20%

larger than classical stereo methods on Tsukuba.

In Chapter 3, we propose a wavelet-based codec for the static depth-image-

based representation (DIBR), which allows viewers to freely choose the viewpoint.

The proposed codec jointly estimates and encodes the unknown depth map from

multiple views using a novel rate-distortion (RD) optimization scheme. The rate

constraint reduces the ambiguity of depth estimation by favoring piecewise-smooth

depth maps. The optimization is efficiently solved by a novel dynamic program-

ming along the tree of integer wavelet coefficients. The codec encodes the image

and the depth map jointly to decrease their redundancy and to provide an RD-

optimized bitrate allocation between the two. The codec also offers scalability

both in resolution and in quality. Experiments on real data show the effectiveness

of the proposed codec.

In Chapter 4, we present a novel codec of depth-image-based representations

for free-viewpoint 3D-TV. The proposed codec relies on a shape-adaptive wavelet

transform and an explicit representation of the locations of major depth edges.

Unlike classical wavelet transforms, the shape-adaptive transform generates small

6

wavelet coefficients along depth edges, which greatly reduces the data entropy. The

codec also shares the edge information between the depth map and the image to

reduce their correlation. The wavelet transform is implemented by shape-adaptive

lifting, which enables fast computations and perfect reconstruction. Experimental

results on real data confirm the superiority of the proposed codec, with PSNR gains

of up to 5.46 dB on the depth map and up to 0.19 dB on the image compared to

standard wavelet codecs.

Finally in Chapter 5, we consider the reconstruction, compression, and render-

ing from a unique camera moving in a static 3D environment. In particular, we

address the problem of side information extraction for distributed coding of videos.

Two interpolation methods constrained by the scene geometry, i.e., based either

on block matching along epipolar lines or on 3D mesh fitting, are first developed.

These techniques are based on a sub-pel robust algorithm for matching feature

points between key frames. The robust feature point matching technique leads

to semidense correspondences between pairs of consecutive key frames. However,

the underlying assumption of linear motion leads to misalignments between the

side information and the actual Wyner-Ziv frames, which impacts the RD perfor-

mances of the 3D model-based DVC solution. A feature point tracking technique

is then introduced at the decoder to recover the intermediate camera parameters

and cope with these misalignments problems. This approach, in which the frames

remain encoded separately, leads to significant RD performance gains. The RD

performances are then further improved by allowing limited tracking at the en-

coder. When the tracking is performed at the encoder, tracking statistics also

serve as criteria for the encoder to adapt the key frame frequency to the video

motion content.

7

CHAPTER 2

SYMMETRIC STEREORECONSTRUCTION FROMPLANAR CAMERA ARRAYS

2.1 Introduction

Online metaverses have emerged as a way to bring an immersive and interactive

3D experience to a worldwide audience. However, the fully automatic creation

of realistic content for these metaverses is still an open problem. The challenge

here is to achieve simultaneously four goals. First, the rendering quality must be

high for the virtual world to look realistic. Second, the geometric quality must

be sufficient to let physics-based simulation provide credible interactions between

objects. Third, the computational complexity must be simple enough to enable

real-time rendering. Finally, the data must admit a compact representation to

allow data streaming across networks.

In this chapter, we propose three contributions toward these goals. First, we

introduce a special rectified 3D space and an associated rectification algorithm

that handles planar arrays of cameras. It gives freedom in the design of camera

arrays, so that their fields of view can be adapted to the scene being recorded. At

the same time, rectification simplifies the reconstruction problem by making the

coordinates of voxels and their pixel projection integers. This removes the need for

further data resampling and simplifies changes of coordinate systems and visibility

computations.

Second, we present a set of data-fusion functions that enable standard binocular

stereo reconstruction [25] to handle arrays with arbitrary number of cameras. Using

8

one depth map per camera, the algorithm reconstructs large surfaces, up to 20%

larger on Tsukuba, and therefore reduces the holes in novel-view synthesis. We

introduce two Markov random fields (MRF), a classical one over the arrays of pixels

and a novel one over the array of cameras. The latter lets us treat all the cameras

symmetrically by defining fusion functions over 4-neighborhoods of cameras.

Finally, we introduce a global fusion algorithm that merges the depth maps into

a unique layered depth image (LDI) [26], a rich but compact data representation

made of a dense depth map with multiple values per pixel. We also show that the

recovered LDI can be segmented fully automatically into sprites with depth [26].

Such sprites are related to geometry images, which can be efficiently rendered and

compressed [27].

The remainder of the chapter is organized as follows. Section 2.2 presents the

previous work, while Section 2.3 describes the rectified space and the rectifica-

tion homographies. Section 2.4 follows with the proposed stereo reconstruction

algorithm, and Section 2.5 with the creation of a global surface model. Finally,

Section 2.6 presents the experimental results.

2.2 Relation to Previous Work

Surface reconstruction methods fall into two categories, those based on large

generic camera arrays and those based on small rectified stereo setups, most of-

ten binocular, where the optical camera axes are normal to the baseline. The

former [28, 29] handle rich depth information and can reconstruct large surfaces.

However, the genericity of the camera locations makes visibility computations dif-

ficult and voxel projections computationally expensive.

In rectified stereo setups [25, 30, 31], on the other hand, visibility and projec-

tions are simple. These setups also allow efficient reconstruction algorithms based

9

on maximum a-posteriori (MAP) inference over MRFs. However, the depth in-

formation extracted from the images tends to be quite poor, especially for linear

arrays which only take advantage of textures with significant gradients along their

baseline. Moreover, the small number of cameras and the constrained viewing

direction strongly limits the volume inside which depth triangulation is possible.

The constraint on the viewing direction can be removed using rectification,

which trades view freedom for image distortion. So far, however, rectification has

been limited to small stereo setups with two [32, 33] or three [34, 35] cameras.

In this chapter, we introduce a special rectified 3D space and show that when

the problem is defined in terms of transformations between 3D spaces, instead of

alignment of epipolar lines, rectification can be generalized to planar arrays with

arbitrary number of cameras.

Camera arrays have access to much richer information than binocular setups.

Quite surprisingly, however, the extra information can prove to be detrimental and

actually reduce the quality of reconstructed surfaces [36]. The issue comes from

partially visible voxels, whose number increases with the number of cameras. A

number of methods tackle this issue [36–38]. However, most of them are asym-

metric, choosing one camera as a reference. Cameras far apart tend to have less

visible surfaces in common, which limits the number of cameras in the array and,

as a consequence, the area of reconstructed surfaces. Moreover, many multiview

stereo methods disregard the relative locations of the cameras when extracting the

depth information from images [28,39], which reduces the discriminative power of

the extracted information.

In the proposed method, we rely on multiple depth maps, one per camera, and

treat all the cameras symmetrically. Furthermore, we define a novel MRF over

the camera array and take into account the relative locations of the cameras. This

10

way, the proposed method handles arrays with arbitrary number of cameras and

extracts the depth information along both horizontal and vertical baselines.

Surface reconstruction based on multiple depth maps has already been studied

in [39–41] but these methods lacked the proposed rectified 3D space, which led to

costly operations to compute visibility, enforce intercamera geometric consistency,

and merge depth maps.

The proposed extraction of sprites from LDIs is related to depth map segmenta-

tion [42], with the added complexity of multiple depth values per pixel. Moreover,

unlike [43], the segmentation is performed automatically and is not limited to

planar surfaces.

2.3 The Rectified Space

2.3.1 Overview

We first consider the problem of rectifying the 3D space and the 2D camera im-

ages to simplify the stereo reconstruction problem. In the following, points are

represented in homogeneous vectors, with x , (x, y, 1)⊺ denoting a point on the

2D image plane and X , (x, y, z, 1)⊺ a point in 3D space. Points are defined up to

scale: x and λx are equivalent for any nonnull scalar λ. This relation is denoted

by the symbol ‘∼’.

Under the pin-hole camera model [33], a 3D point X and its projection x onto

an image plane are related by

x ∼ PX (2.1)

where P is a 3× 4 matrix which can be decomposed as

P = KR

(

I −c

)

(2.2)

11

where I is the identity matrix, R the camera rotation matrix, c the optical center,

and K the matrix of intrinsic parameters. All these parameters are assumed known.

The optical centers of the cameras are assumed to lie on a planar lattice, that

is,

c = o + mv1 + nv2 (2.3)

where o is the center of the grid, v1 and v2 are two noncollinear vectors, and m

and n are two signed integers. The classical stereo pair is a special case of such

an array. Since a pair (m, n) uniquely identifies a camera, we use it to index the

cameras and denote by C the set of pairs (m, n).

The proposed rectification consists in rotating the cameras and transforming

the Euclidean 3D space using homographies. The rectified 3D space is defined as

a space where the projection matrices P(m,n) take the special form

P(m,n) =

1 0 −m 0

0 1 −n 0

0 0 0 1

. (2.4)

It follows that, in the rectified space, a 3D point X = (x, y, d, 1)⊺ is related to

its 2D projection x(m,n) =(

x(m,n), y(m,n), 1)⊺

on the image plane of camera (m, n)

by the equations

x(m,n) = x−md,

y(m,n) = y − nd.

(2.5)

The 2D motion vectors of image points from camera (m, n) to camera (m′, n′)

are equal to d times the baseline (m−m′, n−n′)⊺. Therefore, the third coordinate d

of the rectified 3D space is a disparity, while the third coordinate z of the Euclidean

space is a depth.

The projection of an integer-valued point X is also an integer. Moreover, the

12

Figure 2.1: A few rays of light in the rectified 3D space: rays passing through theoptical centers of camera (0, 0) (a) and camera (1, 0) (b). The rays are alignedwith the voxel grid, which simplifies visibility computations.

rays of light passing through the optical centers are parallel to one another and

fall on integer-valued 3D points, as shown in Figure 2.1, which simplifies visibility

computations.

2.3.2 Rectification homographies

First, we need to recover the grid parameters o, v1, and v2 from the projection

matrices P(m,n). From (2.3), we obtain the system of equations

(

I mI nI

)

o

v1

v2

= c(m,n), ∀(m, n) ∈ C. (2.6)

In the general case, this system is over-constrained and the vectors are obtained

by least mean-square. When the cameras are collinear, one of the vectors is free

to take any value. In that case, the constrained vector is computed by least mean-

square and the free vector is chosen to limit the image distortion. To do so, the

normal vector defined by the cross-product v1 ∧ v2 is set to the mean of the unit

vectors on the optical axes. The free vector is then deduced by Gram-Schmidt

orthogonalization.

We define an intrinsic-parameter matrix K shared by all the rectified cameras

13

as

K ,

f 0 0

0 f 0

0 0 1

(2.7)

where f is the rectified focal length. We also define a matrix V as V , (v1, v2, v1∧

v2) and two 4D homography matrices H1 and H2 as

H1 ,

KV−1 −KV−1o

0 f

, (2.8)

H2 ,

1 0 0 0

0 1 0 0

0 0 0 1

0 0 1 0

. (2.9)

The rectified focal length f is chosen as the mean focal length f of the actual

cameras.

Multiplying (2.1) by KV−1R(m,n)−1K(m,n)−1, introducing I = H−11 H−1

2 H2H1

between P and X, and using the relation Kc(m,n) = f c(m,n), we obtain

KV−1R(m,n)−1K(m,n)−1x(m,n) ∼ P(m,n)H2H1X. (2.10)

By identification, we obtain the relations between Euclidean and rectified quan-

tities

x(m,n) ∼ KV−1R(m,n)−1K(m,n)−1x(m,n), (2.11)

X ∼ H2H1X (2.12)

which are two homographies.

14

The reconstruction of surfaces in the Euclidean space via depth estimation

in the rectified space is then a three-step process. First, images are rectified by

applying the homography (2.11). Then 3D points are estimated in the rectified

space by matching the rectified images. Finally, these 3D points are transfered

back to the Euclidean space by inverting the homography (2.12). Figure 2.2 shows

an example of rectified images.

2.4 Stereo Reconstruction

2.4.1 Overview

We now turn to the stereo reconstruction. In this section, we assume that the

images have been rectified and we drop the hat over mathematical symbols in the

rectified space.

In order to reduce the computational complexity, the dependencies between

cameras in the array are modeled using a MRF where each camera (m, n) is asso-

ciated with an image I(m,n) and a disparity map D(m,n), as shown in Figure 2.3.

Specifically, each value D(m,n)x,y represents the disparity of a 3D point along the ray

of light passing by pixel (x, y) in camera (m, n). At each camera, the dependencies

between pixels are also modeled using a MRF. Stereo reconstruction then aims at

inferring the hidden disparity maps from the observed images, relations between

occupancy and visibility, unicity of the reconstructed scene, and the Markov priors.

An approximate solution is obtained by an iterative process, at the heart of

which lie classical MAP-MRF inferences [30,31,41] applied independently on each

camera. Each inference aims at solving an optimization of the form

minD

(x,y)∈P

(

Px,y,Dx,y+ λgGx,y,Dx,y

+ Sx,y(D))

(2.13)

15

(a) Original images

(b) Rectified images

Figure 2.2: Rectification of four images from the toy sequence [44]. After rectifi-cation, both the rows and the columns of the images are aligned.

I(−1,0)

D(−1,0)

I(0,0)

D(0,0)

I(1,0)

D(1,0)

I(2,0)

D(2,0)

I(−1,1)

D(−1,1)

I(0,1)

D(0,1)

I(1,1)

D(1,1)

I(2,1)

D(2,1)

Figure 2.3: Camera MRF associated with a 2× 4 camera array. Each node repre-sents a camera with an observed image I and a hidden disparity map D. Edgesrepresent fusion functions.

16

where P denotes the set of 2D pixels, λg is a scalar weight, S is a clique po-

tential favoring piecewise-smoothness [30], and Px,y,d and Gx,y,d are respectively

photometric and geometric cost volumes.

The proposed algorithm alternates between inferences and cost volume com-

putations. Its novelty lies in the set of fusion functions computing the costs vol-

umes. Due to the Markov assumption, the fusion functions are defined over 4-

neighborhoods N4, i.e., cross-shaped groups of five cameras, which usually contain

rich depth information but only limited partial occlusions. The overall complexity

of the proposed algorithm is linear in the size of the data.

Although limited, partial occlusions tend to create large photometric costs at

voxels on the surfaces, which leads to erroneous disparities. These outlier costs can

be removed by an explicit visibility modeling [38]. However, visibility depends on

the surface geometry, which introduces a circular dependency. We solve this issue

by introducing an implicit model of partial occlusions, which does not depend on

the surface geometry.

Robust statistics over the four pairwise cliques of each camera 4-neighborhood

can reduce the impact of outlier costs. However, classical robust statistics do not

take into account the relative locations of the cameras and may fail to extract

the depth information along both horizontal and vertical baselines, leading to

photometric cost volumes with poor discriminative power.

Therefore, we propose a robust measure which strives to include the photo-

metric costs from at least one vertical and one horizontal camera clique at each

voxel. We do this by introducing an assumption we call “visibility by opposite

neighbors”: a voxel visible by a camera (m, n) is also visible by at least one of

its horizontal camera neighbors (m − 1, n) and (m + 1, n), and by at least one of

its vertical camera neighbors (m, n − 1) and (m, n + 1). This assumption usually

holds, except for instance for surfaces like picket fences or cameras having less than

17

four neighbors. In the following, we denote the quantities related to horizontal and

vertical pairwise cliques by the superscripts h and v, respectively.

2.4.2 Geometric cost volume G(m,n)

The geometric cost volumes G(m,n) favor consistent disparity maps. In order to

compute them, the disparity maps D(m,n)x,y are first transformed into binary occu-

pancy volumes δ(m,n)x,y,d , whose voxels take value one when they contain surfaces. An

occupancy volume δ(m,n)x,y,d is obtained by initializing it to zero, except at the set of

voxels {(x, y,D(m,n)x,y )} where it is initialized to one.

Since all the occupancy volumes represent the same surfaces, they should be

identical up to visibility and a change of coordinate system. Thanks to the rectifi-

cation leading to (2.5), changing the coordinate system of a volume δ from camera

(0, 0) to camera (m, n) is simply an integer 3D shear φ(m,n) given by

φ(m,n)x,y,d (δ) = δx+md,y+nd,d. (2.14)

A change of coordinate system between two arbitrary cameras is obtained by con-

catenating two 3D shears.

Let us consider camera (m, n) and shear the occupancy volumes of its 4-

neighbors to its coordinate system. Using the assumption of visibility by opposite

neighbors, erroneous occupancy voxels are removed using

δ(m,n)x,y,d ← δ

(m,n)x,y,d ∧

(

δ(m+1,n)x,y,d ∨ δ

(m−1,n)x,y,d

)

∧(

δ(m,n+1)x,y,d ∨ δ

(m,n−1)x,y,d

)

(2.15)

where ∨ and ∧ denotes respectively the “or” and “and” operators.

18

The geometric cost volume is then computed as

G(m,n)x,y,d ←

0, if δ(m,n)x,y,d′ = 0, ∀d′

minδ(m,n)

x,y,d′6=0

min (|d− d′| , τ1) , otherwise(2.16)

where τ1 is a threshold.

2.4.3 Photometric cost volume P (m,n)

The photometric cost volumes favor voxels with similar intensities across images.

They are based on a truncated quadratic error measure [25], in which we introduce

an outlier removal process to discard errors from partially visible voxels. The

outlier removal is based on a hybrid model with an implicit part, which does not

need any occupancy information, and an explicit part, which takes advantage of

the occupancy information when it becomes available. Figure 2.4 illustrates this

occlusion model on a synthetic example and Figure 2.5 shows its impact on the

disparity map estimation.

The explicit model relies on the dependency between occupancy and visibility.

Due to the nature of the rectified 3D space, a binary visibility volume ν(m,n) can

be computed from its associated occupancy volume δ(m,n) using a simple recursion

along the disparity axis

ν(m,n)x,y,d ← ν

(m,n)x,y,d+1 ∧ ¬δ

(m,n)x,y,d+1 (2.17)

where ¬ denotes the “not” operator. The recursion is initialized by setting ν to

one.

In the following, we only detail the computation of quantities related to horizon-

tal cliques. The vertical ones are obtained by a similar reasoning. The computa-

19

(a) Three images of two fronto-parallelplanes: a dark square in front of a brightbackground.

(b) Photometric cost at iteration 1: theimplicit model removes partial occlusionsin camera 1 and limits their impact incameras 0 and 2.

(c) Photometric cost at iteration 2: theexplicit model removes partial occlusionsin all the cameras.

(d) Disparity maps at iteration 1: errorsremain on cameras 0 and 2.

(e) Disparity maps at iteration 2: no errorremains.

Figure 2.4: A simple example demonstrating the behavior of the occlusion model.Perfect disparity maps are obtained in two iterations.

(a) Truncated quadratic cost (b) Proposed cost

Figure 2.5: Cropped disparity maps computed on Tsukuba with five cameras form-ing a cross. The proposed photometric cost reduces the disparity errors due topartial occlusions.

20

tions are conducted independently at each voxel, so we drop the subscript (x, y, d).

We define I(m,n) as the intensity volume obtained by replicating the image I(m,n)

along the disparity axis.

Let us consider the camera (m, n) and its 4-neighborhood. Using (2.14), the

intensity and visibility volumes are sheared to the coordinate system of camera

(m, n). From the truncated quadratic error model and the assumption of visibility

by opposite neighbors, an horizontal error volume Eh(m,n) is computed as

Eh(m,n) = min(

(

I(m,n) − I(m−1,n))2

,

(

I(m,n) − I(m+1,n))2

, τ2

)

(2.18)

where τ2 is a threshold.

The photometric cost Eh(m,n) may still contain large values when the assump-

tion of visibility by opposite neighbors is violated. Therefore, we further discard

outliers by explicitly computing visibility. Using De Morgan’s laws, the validity of

the costs is computed as

V h(m,n) = ¬ν(m,n) ∨ ν(m−1,n) ∨ ν(m+1,n). (2.19)

We now have two pairs of error and validity volumes, (Eh(m,n), V h(m,n)) hor-

izontally and (Ev(m,n), V v(m,n)) vertically. In order to create a photometric cost

volume which includes the depth information from both vertical and horizontal

texture gradients, we define this cost volume as the weighted average

P (m,n) =V h(m,n)Eh(m,n) + V v(m,n)Ev(m,n)

V h(m,n) + V v(m,n), (2.20)

which is only defined when at least one of the validity volumes takes a nonzero

value. Values at voxels where this is not the case are obtained by interpolation.

21

(a) Texture

(b) Disparity

Figure 2.6: The 3-layer LDI obtained on Tsukuba with 25 cameras. By treatingall the cameras symmetrically, the proposed algorithm recovers large areas, whichmay be occluded in the central camera.

2.5 Global Surface Representation

2.5.1 Layered depth image

Using the special nature of the 3D rectified space, we present a simple and efficient

procedure to merge the multiple disparity maps into a unique LDI [26]. The LDI

offers a compact and global surface representation. Figure 2.6 shows an example

of LDI.

To begin with, the disparity maps D(m,n) are transformed into occupancy vol-

umes δ(m,n) as detailed in Section 2.4.2. These volumes are then sheared to a

reference coordinate system, the one of camera (0, 0) for instance.

The disparity layers are extracted in a front to back order by voting. Visibility

22

volumes ν(m,n) are computed from their associated occupancy volumes using (2.17)

and an aggregation volume A is obtained using

Ax,y,d =∑

(m,n)∈C

ν(m,n)x,y,d δ

(m,n)x,y,d . (2.21)

A disparity layer D is extracted by selecting the voxels with the largest aggregation

values along the disparity axis. These voxels are then removed from the occupancy

volumes and the process is repeated until no occupied voxel remains.

2.5.2 Sprites with depth

Due to the smoothness term S in (2.13), the layers of the LDI are piecewise smooth.

They can be converted to smooth sprites with depth by selecting regions of the LDI

which do not contain discontinuities and which introduce as few new boundaries

in continuous regions as possible. The extent of these regions may spread over

multiple layers of the LDI. Figure 2.7 shows some examples of sprites.

Before the sprite extraction begins, the disparities are transformed into depth

using (2.12), so that discontinuities are in the Euclidean space used for rendering.

A sprite is defined as a depth map D and a binary alpha map α, which takes

value one inside the sprite. We focus here on the automatic extraction of sprite

masks. Refinement techniques leading to high-quality textures have been addressed

elsewhere [43] and are beyond the scope of this paper.

The sprites are extracted one at a time. First, an edge detection is performed

on the depth map, followed by a distance transform and a watershed segmenta-

tion [45]. The sprite alpha map is then initialized to the largest watershed region

and the sprite depth map is set to the LDI depth map inside this region.

The sprite is updated by looping through the layers of the LDI and solving a

MAP-MRF inference each time, until convergence. The pixels inside the sprite are

23

Figure 2.7: Examples of sprites extracted from the LDI of Tsukuba with 25 cam-eras. Note the absence of occlusion on the cans.

then removed from the LDI, the newly visible pixels moved to the first layer, and

the process repeated.

The MAP-MRF inference proceeds as follows. Let D(LDI) and α(LDI) be re-

spectively the depth map and the binary alpha map of the current LDI layer. The

sprite and the LDI layer are first fused together to form D and α such that

αx,y = αx,y ∨ α(LDI)x,y ,

Dx,y = αx,yDx,y + (1− αx,y)D(LDI)x,y .

(2.22)

At each pixel (x, y), we define a likelihood px,y of belonging to the sprite and

we model its dependencies by a MRF. The likelihoods inside the sprite mask are

fixed to one and three transition functions are defined

px′,y′ =

(1− 2ρ0)px,y + ρ0 where smooth,

(1− 2ρ1)px,y + ρ1 at small depth differences,

min (1− px,y, 1/2) at discontinuities,

where ρ0 and ρ1 are two transition likelihoods with 0 ≤ ρ0 < ρ1 ≤ 1/2. The third

transition function states that at a discontinuity� if one side belongs to the sprite, the other one does not,

24

Figure 2.8: Disparity map obtained from the four rectified images of the toy se-quence shown in Figure 2.2.� if one side does not belong to the sprite, there is no constraint on the other

side.

Once the inference has been solved, the sprite alpha map is set to one where p is

greater than 1/2 and the sprite depth map is updated accordingly.

2.6 Experimental Results

First, the rectification and stereo reconstruction algorithms are validated on four

images from the toy sequence [44]. The four cameras form a 2 × 2 array with

nonparallel optical axes and nonsquare cells. Figure 2.2 shows the output of the

rectification algorithm. Rectification aligns the rows and columns of the images

and introduces a limited amount of distortion. Figure 2.8 shows the disparity map

obtained by the proposed stereo reconstruction algorithm after five iterations. The

geometry of the scene appears clearly.

The stereo reconstruction is then tested on the binocular sequences of the Mid-

dlebury dataset [25]. In this case, the configuration of the cameras is such that

rectification does not introduce any image distortion. Figure 2.9 shows the dispar-

ity maps obtained by the proposed method using fixed parameters. The proposed

25

Figure 2.9: Disparity maps obtained on the Middlebury dataset with two cameras.The occlusion model leads to sharp and accurate depth discontinuities.

method performs consistently well over the set of sequences. In particular, it does

not suffer from foreground fattening [25]: occlusion modeling, geometric consis-

tency, and piecewise smoothness lead to disparity maps with discontinuities which

are both sharp and accurately located. The disparity maps contain few errors,

mostly located on the left and right image borders, where less depth information

is available.

Since the ground truth is known for this dataset, we also present numerical

performance results in Table 2.1. The error rates of the proposed method are close

to those of the best binocular methods.

Unlike binocular methods, however, the proposed method scales with the num-

ber of cameras. Table 2.2 presents the error rates of the proposed algorithm and

26

Table 2.1: Performances on the Middlebury dataset with two cameras (from top tobottom: percentage of erroneous disparities over all areas for the proposed method,percentage for the best method on each image, and ranks of the proposed method).

Tsukuba Venus Teddy ConesProposed method 1.53 1.04 10.9 8.65

Best method 1.29 0.21 6.54 7.86Rank 3 13 6 6

Table 2.2: Percentage of erroneous disparities over all areas on Tsukuba for severalmulticamera methods. The proposed method achieves competitive error rates andscales with the number of cameras.

2 cameras Proposed 1.5New Kolmogorov, Zabih, 2005 [36] 2.2

Wei, Quan, 2005 [36] 2.75 cameras Proposed 1.3

New Kolmogorov, Zabih, 2005 [36] 1.3Wei, Quan, 2005 [36] 1.3

Drouin et al., 2005 [38] 2.2Kolmogorov, Zabih, 2002 [46] 2.3

25 cameras Proposed 1.3

several multiview algorithms on Tsukuba [25] under three camera configurations:

2 cameras forming a 1×2 binocular configuration, 5 cameras forming a 3×3 cross,

and 25 cameras forming a 5× 5 square.

The proposed method achieves state-of-the art results in both the 2 and 5

camera cases. Moreover, it scales to 25 cameras and handles well the increased

amount of partial occlusions. From these results, it seems that it is advantageous

to switch from 2 to 5 cameras, but that little gain is achieved by further increasing

the number of cameras to 25.

The real gain from the 25 camera array comes from the increased volume in

which stereo reconstruction takes place. Figure 2.6 shows the LDI obtained from

such an array. This LDI has three layers, which means that the rays of light

originating from the optical center intersect the surfaces up to three times.

Figure 2.10 and Table 2.3 show the evolution of the LDI density as a function

27

(a) 2 cameras (b) 5 cameras

(c) 25 cameras

Figure 2.10: Number of disparity values per pixel on Tsukuba (black: no value,white: 3 values). The area of the reconstructed surfaces increases with the numberof cameras.

of the number of cameras. The number of disparity values increases by nearly 20%

when switching from a unique disparity map to a 25-camera LDI. This behavior

is confirmed by Figure 2.11, which shows the texture of the objects on the table

recovered using 2, 5, and 25 cameras. The texture area steadily increases with the

number of cameras, which would reduce the size of holes in renderings from novel

viewpoints. Since large parts of the textures are not visible in the central camera,

they would not have been recovered by stereo algorithms relying on a reference

image.

28

Table 2.3: Number of disparity values in a standard disparity map and in anLDI, for various numbers of cameras on Tsukuba. Using an LDI and 25 camerasincreases the area of reconstructed surfaces by almost 20%.

Number ofdisparityvalues

Relativeincrease

Disparity map 106× 103 0.0%LDI, 2 cam. 108× 103 +2.1%LDI, 5 cam. 116× 103 +9.7%LDI, 25 cam. 127× 103 +19.4%

(a) 2 cameras (b) 5 cameras

(c) 25 cameras

Figure 2.11: Cropped textures extracted from the LDIs of Tsukuba. Occlusionsshrink when the number of cameras increases.

29

2.7 Conclusion

In this chapter, we have first presented a novel rectification algorithm that han-

dles planar camera arrays of any size and greatly simplifies the reconstruction of

3D surfaces. Second, we have introduced a stereo reconstruction method that

treats all cameras symmetrically and scales with the number of cameras. Finally,

we have presented novel algorithms to merge the estimated disparity maps into

layered depth images and sprites with depth. We have validated the proposed

methods by experimental results on arrays with various camera configurations and

reconstructed dense surfaces 20% larger than classical stereo methods on Tsukuba.

Future work will consider multiple planar arrays to obtain closed surfaces.

30

CHAPTER 3

WAVELET-BASED JOINTESTIMATION ANDENCODING OF DIBR

3.1 Introduction

Free-viewpoint three-dimensional television (3D-TV) aims at providing an en-

hanced viewing experience not only by letting viewers perceive the third spatial

dimension via stereoscopy but also by allowing them to move inside the 3D video

and freely choose the viewing location they prefer [47]. The free-viewpoint ap-

proach is also useful for multiuser autostereoscopic 3D displays [48], which have to

generate a large number of viewpoints.

The fundamental problem posed by 3D-TV lies in the massive amount of data

required to represent the set of all possible views or, equivalently, the set of all light

rays in the scene. This set of light rays, called the plenoptic function [11], lies in

general in a seven-dimensional space. Each light ray travels along a line, which is

described by a point (three dimensions), an angular orientation (two dimensions),

and a time instant (one dimension). The last dimension describes the spectrum,

or color, of the light rays. By comparison, 2D videos only lie in a four-dimensional

space made of two angles, time, and color. Therefore, 3D-TV requires the design

of a novel video chain [47].

A large number of methods have been proposed to record and encode the

plenoptic function [49]. They widely differ in the amount of 3D geometry used

to encode the data, which ranges from no geometry at all (e.g., light field) to an

extremely accurate geometry (e.g., texture mapping). On the one hand, relying on

31

the geometry has the advantage of requiring fewer cameras to record the plenoptic

function and allowing the reduction of redundancies between the recorded views [9,

50]. On the other hand, using the geometry has the drawback of limiting the

realism of the synthesized views and requiring a difficult estimation of the 3D

geometry. Indeed, passive 3D geometry estimation from multiple views suffers

from ambiguities, while estimation based on active lighting has only a narrow

scope of application [47].

An efficient trade-off on the 3D geometry, called the depth-image-based rep-

resentation (DIBR), consists in approximating the plenoptic function using pairs

of images and depth maps [8]. Now part of the MPEG-4 standard [10, 51], this

representation allows arbitrary views to be rendered in the vicinity of these pairs.

Since depth maps tend to have lower entropies than images, the DIBR leads to

compact bitstreams. Moreover, realistic images can be synthesized from the DIBR

using image-based rendering (IBR) and depth maps do not need to be estimated

extremely accurately, as long as the viewpoint does not change too much.

Encoding the DIBR presents two difficulties. First, the depth maps are un-

known. Therefore, not only do they have to be encoded, but they also have to

be estimated. Second, the relation between the depth maps and the distortion of

the plenoptic function is highly nonlinear, which makes the rate-distortion (RD)

optimization difficult. In particular, finding an optimal bitrate allocation between

images and depth maps is nontrivial.

A number of methods have avoided these issues by excluding depth maps from

the RD problem. For instance, in [50, 52] depth maps are obtained using block-

based depth estimation, essentially a motion estimation, and encoded in a lossless

fashion. As an alternative to blocks, depth can also be estimated using meshes [53,

54] or pixel-wise regularization [8]. However, in such methods the image encoder

32

and the depth encoder operate at different RD slopes, which penalizes the overall

codec efficiency and makes it difficult to optimally allocate the bitrate [55].

A more principled approach consists in linearizing the RD problem [56,57] using

Taylor series expansions and statistical analysis. It has the advantage of leading to

closed-form expressions and allowing a theoretical analysis of the problem. How-

ever, linearization is only valid for small depth approximations.

Another way of handling the nonlinearity is to assume that depth maps take

a finite number of discrete values. Under some constraints on the dependencies

between depth values, globally optimal solutions can be found using dynamic pro-

gramming [58]. For instance, optimal solutions exist when depth maps are encoded

using differential pulse code modulation (DPCM) [59] or quadtrees [60]. This ap-

proach does not require any ground truth; the estimation and encoding of the depth

maps are carried out jointly. It also takes advantage of the bitrate constraint to

favor smooth depth maps, much as ad-hoc smoothness terms do in computer vi-

sion [25], which reduces the ambiguity of the estimation.

In this chapter, we propose a new wavelet-based DIBR codec which performs

an RD-optimized encoding of multiple views. It differs from classical wavelet-based

codecs in that part of the data to be transformed (i.e., the depth map) is unknown.

Here, as shown in Figure 3.1, both the depth estimation and the depth and image

encoding are performed jointly. Although the problem is nonlinear, we present a

codec able to efficiently find optimal solutions without resort to linearization. We

show that when the depth maps are represented using special integer wavelets,

their joint estimation and coding via RD-optimization can be efficiently solved us-

ing dynamic programming (DP) along the tree of wavelet coefficients. The DP we

introduce in this chapter differs from that of quadtrees [61], as discussed in Sec-

tion 3.3.4. The RD-optimization of the integer wavelets favors piecewise-smooth

depth maps, which reduces the estimation ambiguity and leads to compact repre-

33

Figure 3.1: Overview of the proposed codec: the encoder takes multiple views andjointly estimates and encodes a depth map together with a reference image (theDIBR). The output DIBR can be used to render free viewpoints.

sentations of the data. The joint encoding of the images and depth maps provides

an RD-optimized bitrate allocation. Furthermore, using the fact that depth dis-

continuities usually happen at image edges, it reduces the redundancies between

depth maps and images by coding the two wavelet significance maps only once.

In addition, the proposed codec offers scalability both in resolution, using

wavelets, and in quality, using quality layers. The former allows servers to ef-

ficiently stream data to display devices with inhomogeneous display resolutions

and inside online virtual 3D worlds, where the DIBR may actually only cover a

small portion of the display due to its distance to the viewpoint. The latter lets

servers efficiently stream data over networks with inhomogeneous capabilities. In

both cases, the RD point is chosen on the fly at the server by truncating the

bitstream [19].

There is a close relation between depth maps and 2D motion fields: depth

maps define 3D surfaces, whose projection onto image planes gives rise to motion

fields. Therefore, the techniques designed to solve the RD problem of classical 2D

video coding [62] can usually also be applied to DIBR. Among these techniques,

those described in [63, 64] are related to the proposed wavelet-based coding. In

these codecs, images are split into blocks of variable sizes using quadtrees, and the

motion vectors are DPCM coded. Like our codec, they achieve global optimality

34

using dynamic programming. However, besides being not scalable, their complexity

is exponential in the block sizes, which limits the range of block sizes they can

handle. The complexity of our proposed codec is only linear in the number of

wavelet decomposition levels, due to the special tree structure we introduce.

The remainder of the chapter is organized as follows. Section 3.2 presents

the RD problem at hand, while Section 3.3 details the optimization of the DIBR.

Finally, Section 3.4 presents our experimental results.

3.2 Problem Formulation

First, we define the RD problem that will be solved by the proposed codec. As

illustrated in Figure 3.1, the encoder takes a set of synchronized views as input and

represents them using the DIBR. The decoder receives the DIBR and synthesizes

novel views at 3D locations chosen by the viewers.

The DIBR consists in a subset of the views, called reference views, along

with unknown depth maps. In the following, we limit our study to the case of

static grayscale views. In this case, the DIBR provides an approximation to five-

dimensional plenoptic functions with three spatial dimensions and two angular

dimensions. Since the DIBR only offers a local approximation of the plenoptic

function, the viewers are free to choose arbitrary viewpoints, but only inside a

region of interest (ROI). A natural choice for the shape of the ROI is to take the

union of a set of hypervolumes made of 3D spheres in space and 2D discs in angle,

with one hypervolume associated to each pair of image and depth map. Since the

approximation does not usually degrade abruptly when the distance increases, the

decoder could actually enforce a “soft” ROI boundary by discouraging the viewer

from choosing a viewpoint outside of the ROI without forbidding it.

The distortion introduced by the codec is measured using the mean-square

35

error (MSE) between the recorded views and the views rendered from the DIBR.

Denoting the vth recorded view and its rendered counterpart respectively by the

column vectors Iv and Iv obtained by stacking all the pixels together, the distortion

can be written as

1

NmNnNv

Nv−1∑

v=0

∥Iv − Iv

2

2(3.1)

where ‖.‖2 denotes the 2-norm, Nm, and Nn are respectively the number of rows and

columns in the views, and Nv is the number of views. We denote by N , NmNnNv

the total number of pixels.

The decoder renders novel views using the nearest pair of image and depth

map. This coding scheme is similar to the encoding Intra (I) and Predictive (P)

in the MPEG standard [62]: reference views are I frames while all the other views

are P frames. The total distortion is then the sum of the distortions associated

with each pair of image and depth map. Likewise, the pairs of images and depth

maps are encoded independently of one another, so that the total bitrate is also

the sum of the bitrates associated with each pair. A differential encoding could

increase the bitrate savings but would at the same time reduce the ability of the

decoder to access views randomly [49].

As a consequence, the RD problem can be solved for each pair of image and

depth map independently. Without loss of generality, the remainder of this article

only considers the case where a unique pair is encoded and the reference view is

indexed by v = 0.

The quantized depth map takes a finite number of discrete values, which define a

set of iso-depth planes, as shown in Figure 3.2. Each plane induces a special motion

field between the reference view and an arbitrary view which is a homography [14],

as shown in Figure 3.3. This class of motion fields has the property of transforming

quadrilaterals into quadrilaterals and includes affine transforms as a special case.

36

Figure 3.2: The spatial extent of a ROI (sphere) with one pair of image anddepth map, along with seven views (cones). The central dark cone designates thereference view. The planes represent iso-depth surfaces (3D model reproducedwith permission from Google 3D Warehouse).

In the particular case of rectified views [14], the motion vectors are parallel to the

baseline of the pair of views.

In this framework, the depth estimation is formulated in terms of disparities,

which are inversely proportional to depths. Disparities are better suited to the

geometry of the problem at hand. They take into account the decreasing accuracy

of the depth estimation as depth increases and they are equal to motion vectors in

the case of rectified views.

Both the reference view and the disparity map are encoded in a lossy manner.

Let us denote the encoded reference view by the vector I0 and the jointly estimated

and encoded disparity map by the vector δ. The view Iv is approximated by forward

motion compensation of the reference view I0 using the estimated disparity map δ,

an operation denoted by Mfv (I0; δ) where the f stands for ‘forward.’

The forward motion compensation is performed using an accumulation buffer

and multiple texture-mapping operations, which benefit from hardware accelera-

tion [65]. The accumulation buffer consists in a memory buffer which is initially

empty and progressively filled by the intensity values of texture-mapped views.

37

(a) A reference view s = 0 along with an arbitrary view s = 1 and an iso-depth plane.

(b) The two views and the associated motion fields.

Figure 3.3: The projection of an iso-depth plane onto two views gives rise to amotion field between the two which is a 2D homography.

38

For each disparity value d, the following three steps are taken:� A binary mask m(δ, d) is defined, which takes value one at pixels with dis-

parity value d.� The homography associated with the disparity value d is applied to both the

reference view I0 and the mask m(δ, d) using texture mapping.� The values of the pixels in the accumulation buffer for which the motion-

compensated mask is one are replaced by those in the motion-compensated

view.

In order to enforce occlusion relations between the iso-depth planes, the dispar-

ities are processed in decreasing depth order. The issues of resampling and hole

filling are solved using bilinear interpolation and texture propagation by Poisson

equation [66].

We shall encode both the image I0 and the disparity map δ in the wavelet

domain. Let c and d be the column vectors of their wavelet coefficients, respectively.

The wavelet synthesis operators relate these vectors by

I0 , Tc and δ , T (d), (3.2)

where the matrix T represents the linear wavelet transform for the image and

the function T represents the integer-to-integer wavelet transform for the discrete-

valued disparity map.

We define two significance maps, σ(c) and σ(d), which are binary vectors with

value one in the presence of nonnull wavelet coefficients and zero otherwise. These

maps are not directional, i.e., they are the same for all directional subbands at

each scale of the 2D wavelet transform. In this way, we will be able to compare

39

σ(c) and σ(d) even when the wavelet operators T and T differ in their directional

division of the frequency plane.

In natural images, discontinuities in the disparity map are usually associated

with discontinuities in the image. When they are not, it is very difficult to estimate

the disparity discontinuities from multiple views. Therefore, we can reduce the data

redundancy of the DIBR by coding the image and the disparity significance maps

jointly. This is done by coding only the significance map of the image σ(c) and

assuming the significant coefficients in σ(d) to be a subset of those in σ(c), that is,

σ(c) = 0⇒ σ(d) = 0. (3.3)

This joint encoding also reduces the complexity of the search for the optimal vector

d∗ by fixing a large number of its coefficients to zero.

The total rate R(c, d) is then given by the sum of the rates R(c) and R(d|σ(c))

and the RD problem is

minc,d

1

N

Nv−1∑

v=0

∥Iv −Mfv(Tc; T (d))

2

2

such that R(c) + R(d|σ(c)) ≤ Rmax

(3.4)

where Rmax is the maximum rate allowed. The constraint σ(c) = 0 ⇒ σ(d) = 0

appears implicitly in the rate constraint: R(d|σ(c)) takes the value +∞ when this

constraint is violated. Introducing the Lagrange multiplier λ [67], (3.4) can be

written as

minc,d

1

N

Nv−1∑

v=0

∥Iv −Mfv(Tc; T (d))

2

2+ λ (R(c) + R(d|σ(c))) . (3.5)

This equation has three goals. First, it encodes the reference view. Second, it

estimates the disparity map, and therefore allows the rendering of arbitrary views.

40

Finally, it encodes this disparity map. Solving this optimization will be the topic

of the next section.

3.3 Rate-Distortion Optimization

3.3.1 Overview

Since (3.5) is nonlinear, solving it is not a trivial operation. Therefore, we formulate

several approximations to obtain a computationally efficient method.

First, we ignore the issues of occlusions and resampling. In this way, the motion-

compensation operation becomes invertible and the optimization problem can be

defined either on the rendered views or on the reference view. The latter option

turns out to be much more practical because it decouples the encoded reference

view from the motion compensation. Mathematically, this assumption is equivalent

to∥

∥Iv −Mf

v (I0; δ)∥

2

2≈∥

∥Mb

v(Iv; δ)− I0

2

2(3.6)

where Mbv(Iv; δ) denotes the backward-motion-compensated view Iv. Equation

(3.5) then becomes

minc,d

1

N

Nv−1∑

v=0

∥Mbv(Iv; T (d))−Tc

2

2+ λ (R(c) + R(d|c)) . (3.7)

The MSE term in (3.7) depends on the wavelet vectors c and d in very different

ways: it is quadratic in c but not in d. Therefore, the minimization is solved

using coordinate descent [68], first minimizing c and then d. The minimization of

c ignores the dependency with d due to the shared significance map. This shall

allow us to use classical wavelet coding techniques for c and dynamic programming

for d.

41

The optimization is initialized at high bitrate where the MSE is virtually null,

that is,

Tc ≈ I0 andMbv(Iv; T (d)) ≈ I0. (3.8)

In general, we would need to iterate the successive optimization process until con-

vergence. Here, however, only one iteration is run to reduce the computational

complexity and prevent erroneous disparities from introducing blur in the encoded

reference view I0.

In the remainder of this section, we first describe the optimization of the refer-

ence view in Section 3.3.2. We then detail the optimization of the disparity map,

beginning with the simpler case of one-dimensional views in Section 3.3.3, which

we extend to two-dimensional views in Section 3.3.5. Finally, we present how a

quality scalable bitstream can be obtained in Section 3.3.7.

3.3.2 Reference view

We start with the optimization of the wavelet coefficients c of the reference view.

Fixing d and using the high-bitrate assumption (3.8), the optimization problem

(3.7) becomes

minc

1

NmNn‖I0 −Tc‖22 + λR(c). (3.9)

When the wavelet transform T is nearly orthonormal, like the 9/7 wavelet [18] for

instance, this equation can be further simplified to

minc

1

NmNn

∥T−1I0 − c∥

2

2+ λR(c), (3.10)

which is a standard problem in image compression and is readily solved by wavelet-

based coders [19].

42

3.3.3 One-dimensional disparity map

The next step is to find an optimal solution for the wavelet coefficients d of the

disparity map. To begin with, we consider the special case of one-dimensional

views (Nr = 1).

Fixing c, the optimization problem (3.7) becomes

mind

1

N

Nv−1∑

v=0

Nn−1∑

n=0

(

Mbv,n(Iv; δn)− I0,n

)2

+ λR(d|σ(c)), (3.11)

where I0 , Tc andMbv,n(Iv; d) denotes the intensity value of the pixel in Iv which

would correspond to the pixel n in the reference view if the pixel n had the disparity

value d. Unlike in the previous section, the MSE term is not a quadratic function of

the wavelet coefficients, due to the nonlinearity of motion compensation. Instead,

we take advantage of the fact that the disparity map only takes a finite number of

values.

The MSE term can be written in terms of an error matrix E in which the entry

Ed,n gives the scaled square error with which the pixel n of I0 would be associated

if it had disparity d (see Figure 3.4). That is,

Ed,n ,1

N

Nv−1∑

v=0

(

Mbv,n(Iv; r)− I0,n

)2

. (3.12)

This error matrix is also called “disparity space image” [25] and is independent of

the disparity map δ. Computing this matrix has a complexity of O(NNd), where

Nd is the number of disparity values.

We study the encoding of the disparity map using two transforms, namely the

Sequential (S) transform [19] and a transform we call the Laplace (L) transform

due to its resemblance to the Laplacian pyramid [69]. Both provide a compact

representation of discrete and piecewise-constant disparity maps. Both also in-

43

� - abscissa

�-disparity

����

� ����

Figure 3.4: An error matrix E from the Tsukuba image set with two optimal pathsoverlaid, λ = 0 (dashed) and λ =∞ (solid). Lighter shades of gray indicate largersquared intensity differences.

Table 3.1: Analysis and synthesis operators of the Laplace (L) and Sequential (S)transforms (see text for details).

L-transformanalysis synthesis

l(j)n =

l(j−1)2n +l

(j−1)2n+1

2

h(j−1)2n = l

(j−1)2n − l

(j)n

h(j−1)2n+1 = l

(j−1)2n+1 − l

(j)n

(3.13)l(j−1)2n = l

(j)n + h

(j−1)2n

l(j−1)2n+1 = l

(j)n + h

(j−1)2n+1

(3.14)

S-transformanalysis synthesis

l(j)n =

l(j−1)2n +l

(j−1)2n+1

2

h(j)n = l

(j−1)2n − l

(j−1)2n+1

(3.15)l(j−1)2n

S0= l(j)n −

h(j)n

2

+ h(j)n

l(j−1)2n+1

S1= l(j)n −

h(j)n

2

⌋ (3.16)

duce graphs of dependencies between their wavelet coefficients as trees, so that the

problem can be efficiently solved using dynamic programming [70]. They differ in

their redundancy, the former being nonredundant. They also differ in the com-

plexity of their optimization with regard to the number of disparity values Nd, the

latter being of linear complexity and the former of quadratic complexity.

The analysis and synthesis operators of these two transforms are given in Ta-

ble 3.1, where ⌊x⌋ denotes the largest integer less than or equal to x. They relate

the low-pass coefficients l and the high-pass coefficients h between the level j − 1

with finer resolution and the level j with coarser resolution.

44

The wavelet vector d is made of all the high-pass coefficients h, along with the

low-pass coefficients l of the coarsest level j , L − 1. The low-pass coefficients l

at the finest level j , 0 are equal to the disparities δ, that is, l(0)n = δn.

The probability of the wavelet coefficients is approximated as follows. The co-

efficients l at the coarsest level and the coefficients h are assumed to be jointly

independent. The coefficients l follow a uniform distribution. The coefficients h

are null with probability one if the corresponding image coefficients are insignifi-

cant and otherwise follow a discrete and truncated Laplace distribution with zero

mean [19], that is,

p(h(j)n |σ(j)

n (c)) =

1h(j)n =0

if σ(j)n (c) = 0,

1Ze−

|h(j)n |b 1

|h(j)n |≤Nd

otherwise

(3.17)

where b is a parameter to be estimated and Z is a normalizing constant. This

probability distribution defines the entropy of the data [17], which we use as an

approximation of the actual bitrate. The bitrate, in bits per pixel, is therefore

R(d|σ(c)) = −L−1∑

j=0

Nh(j)−1∑

n=0

log2

(

p(h(j)n |σ(j)

n (c)))

+ cst (3.18)

where log2 denotes the logarithm to base 2, cst is a constant term independent of

d, and Nh(j) is the number of high-pass coefficients at level j. We introduce the

cost function

C(h(j)n ) ,

+∞ if σ(j)n (c) = 0 and h

(j)n 6= 0,

µ|h(j)n | otherwise

(3.19)

where µ , λ/(b log 2) acts as a smoothness factor. Using this cost function, the

45

equation of the bitrate (3.18) becomes

R(d|σ(c)) =1

λ

L−1∑

j=0

Nh(j)−1∑

n=0

C(h(j)n ) + cst (3.20)

and the optimization problem (3.11) can be written as

mind

Nn−1∑

n=0

El(0)n ,n

+

L−1∑

j=0

Nh(j)−1∑

n=0

C(h(j)n ). (3.21)

3.3.4 Dynamic programming

The optimization problem (3.21) is still a minimization over a space with large

dimension. However, it can be solved recursively by a series of minimizations over

small search spaces. The approach consists in using the commutativity of the sum

and min operators to group the terms of the summation together based on the

variables they depend on. This is possible due to the choice of wavelets, which do

not introduce loops in the dependency graph of the group of terms, as shown in

Figures 3.5 and 3.6. In these figures, the notation E:,n denotes the column n of the

error matrix E. This column vector contains the errors of the different disparity

values at the pixel location n.

Example 1 Let us consider a simple example to illustrate the algorithm. The

optimization problem associated with a two-level L transform, emphasized by a

dashed box in Figure 3.5, is given by

minl(1)0 ,h

(0)0 ,h

(1)1

(

El(1)0 +h

(0)0 ,0

+ El(1)0 +h

(0)1 ,1

+C(h(0)0 ) + C(h

(0)1 ))

.

(3.22)

By grouping the terms of the summation together and commuting the min and sum

46

E:,0 E:,1 E:,2 E:,3

l(0)0 l

(0)1 l

(0)2 l

(0)3

C(h(0)0 ) C(h

(0)1 ) C(h

(0)2 ) C(h

(0)3 )

l(1)0

h(0)0 h

(0)1

l(1)1

h(0)2 h

(0)3

C(h(1)0 ) C(h

(1)1 )

l(2)0

h(1)0 h

(1)1

Figure 3.5: Dependency graph of a three-level L transform. The coefficients inbold are those included in the wavelet vector d. Gray nodes represent the MSEand rate terms of the RD optimization. The dashed box highlights the two-levelL transform associated with (3.22).

E:,0 E:,1 E:,2 E:,3

l(0)0 l

(0)1 l

(0)2 l

(0)3

C(h(1)0 ) C(h

(1)1 )

l(1)0

h(1)0

l(1)1

h(1)1

C(h(2)0 )

l(2)0

h(2)0

Figure 3.6: Dependency graph of a three-level S transform. The coefficients inbold are those included in the wavelet vector d. Gray nodes represent the MSEand rate terms of the RD optimization.

47

operators, it can be rewritten as

minl(1)0

(

minh(0)0

(

El(1)0 +h

(0)0 ,0

+ C(h(0)0 ))

+minh(0)1

(

El(1)0 +h

(0)1 ,0

+ C(h(0)1 ))

)

(3.23)

which reduces the complexity from cubic to quadratic.

Next, we illustrate how to solve the inner minimizations. Let us consider the

minimization over h(0)0 with a smoothness factor µ = 0.5. We assume that the

disparity values range from 0 to 5 and solve for the case l(1)0 = 2. Let us assume

that the first column vector of the error matrix E is

l(0)0 0 1 2 3 4 5

ET

:,0 2 5 3 0.25 4 2(3.24)

Stacking the values of the cost function C(h(0)0 ) for each h

(0)0 (note that h

(0)0 =

l(0)0 − l

(1)0 = l

(0)0 − 2) into a cost vector C using (3.19) gives

l(0)0 0 1 2 3 4 5

h(0)0 −2 −1 0 1 2 3

CT 1 0.5 0 0.5 1 1.5

(3.25)

The sum of these two vectors is

l(0)0 0 1 2 3 4 5

ET

:,0 + CT 3 5.5 3 0.75 5 3.5(3.26)

The minimum is therefore 0.75, which is reached at l(0)0 = 3. By the definition

of the synthesis operator (3.14), it follows that the optimal high-pass coefficient

associated with l(1)0 = 2 is h

(0)0 = 1.

48

In the general case, the recursive minimization is defined using a pyramid of

error matrices {E(j), j ∈ [0, . . . , L− 1]}. The error matrix at the finest level j = 0

is defined by

E(0) , E. (3.27)

The error matrices of the L transform at coarser levels are given by

E(j)d,n = min

h(j−1)2n

(

E(j−1)

d+h(j−1)2n ,2n

+ C(h(j−1)2n )

)

+ minh(j−1)2n+1

(

E(j−1)

d+h(j−1)2n+1 ,2n+1

+ C(h(j−1)2n+1 )

)

.

(3.28)

The error matrices of the S transform at coarser levels are given by

E(j)d,n = min

h(j)n

(

E(j−1)

S0(d,h(j)n ),2n

+ E(j−1)

S1(d,h(j)n ),2n

+ C(h(j)n ))

(3.29)

where S0(.) and S1(.) denote the synthesis operators defined by (3.16).

Computing an error matrix E(j) of the S transform has a complexity quadratic

in the number of disparity values Nd: error values need to be computed for each

disparity value d and each value of the high-pass coefficient h(j)n . On the other

hand, an error matrix E(j) of the L transform can be computed with only linear

complexity, as was shown in [71] in the case of Markov random fields with linear

smoothness function.

The optimization problem (3.21) becomes simply

minl(L−1)n

E(L−1)

l(L−1)n ,n

(3.30)

for each low-pass coefficient l(L−1)n at the coarsest level L− 1.

The pyramid of error matrices is associated with a pyramid of matrices of

high-pass coefficients {H(j), j ∈ [0, . . . , L− 1]}. At each level, they store the high-

49

pass coefficients which achieve the minima in (3.28) or (3.29). Once the optimal

low-pass coefficients l(L−1)∗n are known, the low-pass and high-pass coefficients at

other levels are obtained by backtracking using the matrices H(j) and the synthesis

operators (3.16) or (3.14).

Therefore, the overall algorithm to solve (3.21) is the following:

1. The initialization creates the error matrix E(0).

2. The bottom-up pass computes the matrices E(j) and H(j).

3. The coarsest-level minimization finds the optimal low-pass coefficients l(L−1)∗n .

4. The top-down pass backtracks to compute the optimal low-pass and high-pass

coefficients l(j)∗n and h

(j)∗n at all levels.

At the end, both the globally optimal disparity map δ∗ and the globally optimal

wavelet vector d∗ are known. The initialization has a complexity of O(NNd), the

bottom-up pass of O(NnN2d ) in the case of the S transform and O(NnNd) in the case

of the L transform, the coarsest-level minimization of O(NnNd) and the top-down

pass of O(Nn).

This algorithm shares similarities with quadtree-based motion estimation [61].

In quadtree estimation, small minimizations are solved at each node of the tree to

find optimal motion vectors and decide whether to split or merge. In the proposed

algorithm, small minimizations are also solved at each node, but they find optimal

wavelet coefficients instead. A major difference between quadtrees and wavelets lies

in the way the data is stored in the tree. Quadtrees store the data at their leaves

using independent coefficients, while wavelets spread the data over the entire tree

using differential coefficients. Therefore, wavelets offer resolution scalability while

quadtrees do not. Another difference lies in the induced smoothness. Quadtrees

enforce constant values inside blocks but no smoothness between blocks, while

50

wavelets induce a smoothness between all pixels. Our experimental results shall

show that the latter reduces spurious noise in the estimated disparity maps.

3.3.5 Two-dimensional disparity map

We now extend the optimization procedure to two-dimensional views. The error

matrix E(0)d,n becomes an error tensor E

(0)d,m,n with three dimensions: rows m, columns

n, and disparities d. It is defined as

E(0)d,m,n ,

1

N

Nv−1∑

v=0

(

Mbv,m,n(Iv; d)− I0,m,n

)2

. (3.31)

Its computation has a complexity of O(NNd), which remains linear in all the

variables.

The two-dimensional extension of the L transform is also straightforward. Its

synthesis operator (3.14) simply becomes

l(j−1)2m,2n = l(j)m,n + h

(j−1)2m,2n

l(j−1)2m+1,2n = l(j)m,n + h

(j−1)2m+1,2n

l(j−1)2m+1,2n+1 = l(j)m,n + h

(j−1)2m+1,2n+1

l(j−1)2m,2n+1 = l(j)m,n + h

(j−1)2m,2n+1

. (3.32)

The computational complexity at each node of the dependency tree remains O(Nd),

with a total complexity of O(NmNnNd) for the bottom-up pass.

The two-dimensional extension of the S transform is slightly more complex.

We follow the classical approach of applying the one-dimensional wavelet trans-

form twice at each scale [18], once horizontally and once vertically. However,

we depart from the usual four-band division of the frequency plane (high-high,

high-low, low-high, low-low) shown in Figure 3.7(a). If we followed this division,

51

���������

���������������������

���

���

������

���

����������

���

��

�� ��

��������������

��������������

��������������������� ���������������������

����� �������

������� ���������

��� ���

��� ���

(a) Four-band division

��������������

�������������

��������������

��������������

��������������������� ���������������������

����� �������

������� ���������

��

�����������������

�����������������

��������������������

��������������������

����

����

������

������

���

���

(b) Three-band division

Figure 3.7: Two divisions of the frequency plane and the associated graphs ofdependencies between the coefficients of the S transform.

the minimizations at each node of the dependency tree (3.29) would depend on

four variables: ll(j)m,n, hl

(j)m,n, lh

(j)m,n, and hh

(j)m,n. Therefore, the complexity of each

minimization would grow from O(N2d ) to O(N4

d ), which is only feasible when few

disparity values are allowed.

Instead we propose to divide the frequency plane into only three bands at each

scale, as shown in Figure 3.7(b). The first transform is applied vertically, leading to

two bands (low, high). The second transform is applied horizontally, but only onto

the previous low band. This way the complexity at each node of the dependency

52

tree remains O(N2d ), with a total complexity of O(NmNnN2

d ) for the bottom-up

pass.

3.3.6 Bitrate optimization

The parameter b of the Laplace distribution is estimated using bracketing and a

search akin to bisection [55, 66]. A large bracket is initially chosen, whose size is

iteratively reduced. At the ith iteration, the optimal coefficients {l, h} are found

and the actual parameter b(i) is estimated by minimizing the Kullback-Leibler

divergence [17] between the histogram of the coefficients {h} and the Laplace

distribution (3.17). The current Lagrange multiplier λ(i) is obtained using the

equation

λ(i) = µ(i)b(i) log 2 (3.33)

and the parameter µ is updated by

µ(i+1) =λ

λ(i)µ(i) (3.34)

where λ is the target RD slope used to encode the reference view. This update

equation has the advantage of being independent of the bracket size, derivative-

free, and exact when λ is a linear function of µ. The iterations end when the

relative error |λ− λ(i)|/λ becomes small enough.

The final bitstream of the disparity map is generated by fixed-length coding

of the low-pass coefficients in d, fixed-length coding of the sign of the high-pass

coefficients, and arithmetic coding [17] of their absolute values. Only the high-pass

coefficients for which σ(c) is one are encoded.

53

3.3.7 Quality scalability

The wavelet-based encoding of the reference views allows both resolution and qual-

ity scalabilities [19]. As is, the proposed wavelet-based encoding of the disparity

map only allows resolution scalability. Quality scalability is achieved by introduc-

ing quality layers [19].

The qth quality layer is associated with a vector of wavelet coefficients d(q),

which is encoded using differential pulse code modulation (DPCM) between quality

layers. The optimization problem then becomes

min{d(q)}

Nq∑

q=1

(

1

N

Nv−1∑

v=0

∥Mb

v(Iv; T (d(q)))− I(q)0

2

2

+ λ(q)R(d(q) − d(q−1)|c(q), d(q−1))

)(3.35)

where Nq is the number of quality layers, I(q)0 is the quantized reference view from

the qth quality layer, and λ(q) is the associated Lagrange multiplier. The vector d(0)

is chosen to be the null vector. The differential vectors d(q)−d(q−1) are assumed to

be jointly independent and to follow a discrete and truncated Laplace distribution

parameterized by b(q).

The optimization problem is solved sequentially for each d(q). The minimization

for the qth quality layer is given by

mind(q)

1

N

Nv−1∑

v=0

∥Mb

v(Iv; T (d(q)))− I(q)0

2

2

+ λ(q)R(d(q) − d(q−1)|c(q), d(q−1))

(3.36)

which is similar to the minimizations described in the previous sections and can

be solved in the same way.

54

3.4 Experimental Results

We present experimental results on two image sets, Tsukuba and Teddy [25], dis-

played in Figure 3.8. The Tsukuba set has a fairly limited range of disparities,

with only 15 disparity values. On the other hand, the Teddy set has a much larger

range, with 60 disparity values. As a consequence, the Teddy set contains much

larger areas of occlusions and disocclusions.

Experiments have been run in the grayscale domain with intensity values in the

range [0, 1]. A border of two pixels has been removed around the images of Tsukuba

to compensate for camera artifacts. The experiments have been conducted using

nine views with the central view as reference for the Tsukuba set, and two views

with the left view as reference for the Teddy set. In the following, the bitrate

is defined in bits per reference-view pixels, which does not depend on the total

number of views in the optimization.

In order to benchmark the performances of the proposed RD-optimized wavelet

codecs, they are compared against two other classical codecs, one based on block

matching [50, 52] and the other on quadtrees [61, 72]. These two codecs usually

handle 2D motion vectors instead of 1D disparities. To obtain a fair comparison,

they are adapted to perform one-dimensional optimizations. The encoding is per-

formed in closed loop to obtain the least possible distortion at the decoder. The

block-based encoder simply minimizes the MSE of 8× 8 blocks and generates the

bitstream using fixed-length codes. The quadtree-based encoder performs a full

RD-optimization, as detailed in [61].

All codecs rely on the QccPack implementation of SPIHT [73] to encode the

reference view. Therefore, the wavelets, quadtrees, and blocks are all optimized

using the same error tensors E. The codecs based on quadtrees and wavelets

55

(a) Tsukuba image set

(b) Teddy image set

Figure 3.8: The two sets of images used in the experiments (from [25]).

56

(a) Level 0 (b) Level 2

(c) Level 4 (d) Level 6

Figure 3.9: Disparity map of the Teddy set at four resolution levels, showing theresolution scalability of the wavelet-based representation.

allocate automatically the birate of disparity maps. Therefore, the codecs are

compared at RD points with equal RD slopes, but possibly different total bitrates.

Figure 3.9 illustrates the resolution scalability of the proposed wavelet-based

representations. Unlike quadtrees which store the disparity information only at

their leaves, wavelets store this information over the entire tree, which allows par-

tial decoding of the tree at multiple resolutions. In the experiments, the wavelet

decomposition is performed completely, that is, until the low-pass band is reduced

to a unique pixel. Experiments have shown that stopping the decomposition earlier,

as is usually done in image coding, does not allow enough information aggregation

in large textureless regions and leads to erroneous disparity estimations.

57

Figures 3.10 and 3.11 show the DIBR encoded at three RD slopes – approx-

imately 1 × 10−2, 2 × 10−3, and 4 × 10−4 – which correspond to reference views

encoded at bitrates of 0.1 bpp, 0.5 bpp, and 1.0 bpp.

The block-based encoder appears extremely sensitive to the lack of image tex-

ture. At low bitrates, the disparity map becomes extremely noisy and is a poor

estimation of the ground truth. The noise is much reduced at higher bitrates, but

remains significant in some areas, like the upper-right corner of Tsukuba or the

roof of the house in Teddy. This seriously hinders the synthesis of novel viewpoints.

The quadtree-based encoder proves to be much more reliable. Using the RD-

optimization, it is able to gracefully decrease the quality of the disparity map when

the bitrate is reduced. Not only do the disparity maps become coarser, but they

also tend to have less spurious noise because such noise has a high bitrate cost.

The wavelet-based encoders, using both the S transform and the L transform,

demonstrate a similar behavior. Compared to the quadtree-based encoder, they

tend to generate disparity maps with less spurious noise. In quadtrees, the rate

constraint favors larger blocks. However, the disparity values between blocks are

independent. In wavelets, on the other hand, the rate constraint favors small

wavelet coefficients, which creates dependencies between blocks and enforces inter

block smoothness. The superiority of wavelets over quadtrees is especially notice-

able in the case of larger disparity ranges, which makes them more effective at

estimating and encoding the complex geometry of realistic 3D scenes.

All of these encoders have issues in areas of occlusions and disocclusions, as

can be seen for instance around the chimney of the Teddy set. This creates large

disparity errors which are detrimental for novel-view synthesis. This issue is con-

firmed by Figure 3.12. It shows two views synthesized from the DIBR encoded

at the RD slope 2× 10−3, along with the differences between the synthesized and

actual views. The dominant noise is due to occlusions and disocclusions. It has

58

���������������� �

������ ������������������ �����

������ ������������������ �����

������ ���������������������

������ ����������������� �

Figure 3.10: The DIBR of the Teddy set at three RD slopes corresponding toreference-view bitrates of 0.1 bpp, 0.5 bpp, and 1.0 bpp (from left to right). The Sand L transforms generate disparity maps that degrade gracefully with the bitrateand contain less spurious noise than quadtrees or blocks.

59

���������������� �

������ ������������������ �����

������ ������������������ �����

������ ���������������������

������ ����������������� �

Figure 3.11: The DIBR of the Tsukuba set at three RD slopes corresponding toreference-view bitrates of 0.1 bpp, 0.5 bpp, and 1.0 bpp (from left to right). The Sand L transforms generate disparity maps that degrade gracefully with the bitrateand contain less spurious noise than quadtrees or blocks.

60

(a) Tsukuba

(b) Teddy

Figure 3.12: Views synthesized from the DIBR with a reference view encoded at0.5 bpp and differences with the original views. At low quantization noise, theerrors are mostly due to occlusions and disocclusions.

two sources. First, in these areas there are no correspondences between images,

which leads to erroneous disparity estimations. Second, the hole-filling process is

efficient when disocclusions are small, but has difficulty handling large occlusions

like the one on the right of the Teddy set.

We confirm this qualitative analysis by a quantitative one. Figure 3.13 shows

the RD performances of all the codecs. The block-based codec is the least effi-

cient. Without some kind of regularization this method is not suitable for novel

view synthesis, which underlines the interest of jointly estimating and encoding

the disparity map. In Tsukuba, where the disparity range is small, quadtrees out-

perform the L transform by 0.09 dB and the S transform by 0.12 dB. On the other

61

��������

������

������������

�� ����������

��� ��� ��� ��� �

������ ����� �

��

��

��

��

�� � ���

(a) Tsukuba

��� ��� ��� ��� �

��

��

��

��

������ ����� �

�� � ���

��������

������

������������

�� ����������

(b) Teddy

Figure 3.13: Rate-distortion performances of the encoders based on wavelets (Sand L transforms), quadtrees, and blocks. Wavelets are superior to quadtrees andblocks in the case of larger disparity ranges.

hand, in Teddy, where the disparity range is much larger, the wavelets outperform

quadtrees by 0.84 dB for the L transform and 0.70 dB for the S transform. Both

the L and the S transform offer similar RD performances. The advantage of the L

transform is primarily the lower computational complexity of its optimization.

Figure 3.14 compares the quality-scalable versions of the wavelets to their non-

scalable counterpart. In Tsukuba, quality scalability has a PSNR cost of at most

0.29 dB, both for the S and L transform. In Teddy, the PSNR cost is lower for

the L transform, with at most 0.34 dB, than for the S transform, with at most

0.47 dB.

Finally, Figure 3.15 reports the optimized bitrate allocation between the ref-

erence view and the disparity map. Except at very low bitrates, the allocation

remains stable across the whole range of bitrates, with between 13% and 23%

of the total bitrate dedicated to the disparity map. This is consistent with the

heuristic ratio of 10% proposed in [10].

62

������������ �����������

������������������

������������ ���� �������

������������ ��������

��� ��� ��� ���

����� ���� ����

��

��

��

��

��

�������

(a) Tsukuba

������������ �����������

������������������

������������ ���� �������

������������ ��������

��� ��� ��� ���

����� ���� ����

��

��

��

��

��

�������

(b) Teddy

Figure 3.14: RD loss due to quality-scalable coding. The loss remains limited overthe whole range of bitrates.

��

��

��

��

��

������� ��������

������

����������

����������

��� ��� ��� ��� � ��������!""�

(a) Tsukuba

��

��

��

��

��

��

������� ��������

������

����������

����������

��� ��� ��� ��� � ��������!""�

(b) Teddy

Figure 3.15: Fraction of the bitrate allocated to the disparity maps. Except atvery low bitrates, the rate ratios are stable with values between 13% and 23%.

63

3.5 Conclusion

This chapter has proposed a novel wavelet-domain DIBR codec able to approxi-

mate static plenoptic functions inside a ROI. The wavelet coefficients for both the

images and the disparity maps have been estimated and encoded jointly to provide

an optimized bitrate allocation and reduce the ambiguity of the disparity estima-

tion. In spite of the nonlinearity of the optimization problem, a globally optimal

encoding of the disparity maps has been found using dynamic programming along

the tree of integer wavelet coefficients. In addition to the resolution scalability

intrinsic to wavelets, quality scalability has been introduced using quality layers.

Finally, experimental results on real data have confirmed the performances of the

proposed codec. Future work will aim to extend the optimization of the disparity

map to more general integer wavelets, mitigate the issues due to occlusions, and

compress dynamic plenoptic functions.

64

CHAPTER 4

JOINT ENCODING OF THEDIBR USINGSHAPE-ADAPTIVE WAVELETS

4.1 Introduction

Free-viewpoint three-dimensional television (3D-TV) provides an enhanced view-

ing experience in which users are able to perceive the third spatial dimension and

are free to move inside the 3D video [47]. With the advent of multiview autostereo-

scopic displays [47], 3D-TV is expected to be the next evolution of television after

high definition.

Three-dimensional television poses new technological challenges, which include

recording, encoding, and displaying 3D videos. At the core of these challenges

lies the massive amount of data required to represent the set of all possible views,

called the plenoptic function [24], or at least a realistic approximation of them.

The depth image based representation (DIBR) has recently

emerged as an effective approach [10], which allows both compact data repre-

sentation and realistic view synthesis. As shown in Figure 4.1, the DIBR is made

of pairs of images and depth maps, each of which provides a local approximation

of the plenoptic function. At the decoder, arbitrary views are synthesized from the

DIBR using image-based rendering [24].

Each pair of image and depth map can be seen as a four-channel image with

one channel for luma, two for chroma and one for depth. Therefore, classical im-

age and video codecs like MPEG-2, H.264/AVC, and JPEG2000 only need minor

modifications to be able to handle DIBRs [10, 47]. This approach however fails to

65

(a) Depth map and edges (b) Image and edges

Figure 4.1: Input data of the proposed DIBR codec: shared edges superimposedover a depth map (a) and an image (b).

take into consideration the fact that images and depth maps exhibit widely differ-

ent statistics, which make classical transforms like the discrete wavelet transform

(DWT) or the discrete cosine transform (DCT) ill-suited to encode DIBRs. In

particular, depth maps tend to contain sharper edges than images, which create

streaks of large coefficients in the transform domain.

In [74], it was shown that a representation based on platelets, which assumes

piecewise planar areas separated by piecewise linear edges, could lead to major

rate-distortion (RD) gains. However, the practical use of platelets is limited by

the computational cost of encoding. Unlike standard image codecs which rely on

fast transforms (such as DCT or DWT), quantizers, and entropy coders to encode

the data [19], platelets require the encoder to solve a complex RD optimization

problem.

Moreover, both platelet-based codecs and standard image codecs ignore another

source of data redundancy: the correlation between depth edges and image edges.

Indeed, the 3D scenes are usually made of objects with well-defined surfaces, which,

by projection onto the camera image plane, create edges at the same locations in

the depth map and the image.

66

In this chapter, we propose a codec which takes into account both sources of re-

dundancies. It encodes the locations of the major depth edges explicitly and treats

the regions they separate as independent during the wavelet transform by using an

extension of the shape-adaptive discrete wavelet transform (SA-DWT) [75]. The

proposed SA-DWT generates small wavelet coefficients both in smooth regions and

along the encoded edges. It is efficiently computed using lifting [19], a procedure

which is fast, in place, simple to implement, and trivially invertible. Moreover, the

explicit edge coding allows the codec to share the edge information between the

depth map and the image, which reduces their joint redundancy.

Thus the proposed codec amounts to a simple modification of a scheme that

independently codes depth and image using wavelet-based codecs. As a result, we

can benefit from the large body of existing work on wavelet-based codecs. However,

as we shall see, this modification leads to significant gains, up to 5.46 dB.

The remainder of the chapter is organized as follows. An overview of the

proposed codec is presented in Section 4.2. A detailed description of its components

is given in Sections 4.3, 4.4, and 4.5, which consider respectively the SA-DWT, the

handling of edges during lifting, and edge coding. Finally, Section 4.6 presents

experimental results.

4.2 Proposed Codec

As shown in Figure 4.2, the proposed DIBR encoder takes three signals as input,

which represent respectively the depth map, the image, and the edges. The DIBR

encoder is made of two wavelet encoders [19], one processing the depth map and

the other the image. Like standard wavelet encoders, each is made of a wavelet

transform followed by quantization and entropy coding. The DIBR decoder simply

inverses each of these steps in reverse order.

67

Figure 4.2: Overview of the proposed encoder. It relies on a SA-DWT and an edgecoder (gray boxes) to reduce data correlations, both within and between the imageand the depth map.

The major novelty of the proposed encoder lies in the introduction of a trans-

form by SA-DWT for both the image and the depth map. SA-DWT requires an

explicit representation of edges, which leads to the introduction of a lossless edge

encoder made of an edge transform and entropy coding. The explicit edge repre-

sentation has the advantage that it can be shared by both the image and depth

SA-DWTs, which leads to bitrate savings.

The DIBR encoder generates three bitstreams (image, depth, edge), which are

eventually concatenated by a multiplexer (MUX).

4.3 Shape-Adaptive Wavelet Transform

SA-DWT [75] relies on the notion of image object. An image object is made of

a binary mask, which indicates which pixels are inside the object, and an im-

age. The image values at outside pixels are assumed to be missing. The object is

transformed by an adequate downsampling of the mask and a DWT of the image,

extrapolating the missing values around the object boundary by symmetric exten-

sion. The 2D SA-DWT is implemented as separable 1D SA-DWTs and the process

is iterated on the low-pass band to obtain a multiresolution transform. SA-DWT

has the advantage of avoiding creating large wavelet coefficients around the object

boundary by treating the inside and outside areas as statistically independent.

68

This is this idea of statistically independent areas that we shall use to efficiently

code the DIBR. In our case, however, there is no single object. Instead, the DIBR

is made of multiple superimposed objects. This issue is overcome by replacing the

binary mask by a binary edge map and by treating the areas on opposite sides of

the edges as statistically independent.

Figure 4.3 presents an example of signal processed by the proposed SA-DWT.

The SA-DWT clearly creates much less nonzero wavelet coefficients around edges

than the standard DWT, which leads to bitrate savings. SA-DWT has the dis-

advantage of requiring a bitrate overhead to code the edge location. However,

experiments show that this overhead is more than compensated by the savings.

We rely on lifting [19] to implement the 1D SA-DWT. First, we present the

standard lifting and then describe the modifications making it shape adaptive. The

lifting splits the samples at odd and even locations into two cosets and modifies

them alternatively using a series of “steps.”

By convention, lifting begins by modifying the odd coset by a so-called “predict”

step, which transforms a signal x into a signal y such that

y2t = x2t,

y2t+1 = x2t+1 + λ2k(x2t + x2t+2),

(4.1)

where t denotes the sample locations, 2k the step number, and λ2k a weight. The

even coset is modified by a so-called “update” step, which transforms a signal x

into a signal y such that

y2t = x2t + λ2k+1(x2t−1 + x2t+1),

y2t+1 = x2t+1.

(4.2)

Figure 4.4 shows a graphical example of these lifting steps. Weights λ for

69

(a) Signal made of constant, linear and cubic pieces. The dashed red lines indicateedges.

(b) High-pass bands of a standard 9/7DWT with symmetric extension.

(c) High-pass bands of a 9/7 SA-DWTwith cubic extension.

Figure 4.3: Comparison of standard and shape adaptive DWTs. In the latter case,all but the coarsest high-pass band are zero.

70

Figure 4.4: The four lifting steps associated with a 9/7 wavelet, which transformthe signal x first into a and then into y. The values x2t+2 and a2t+2 on the otherside of the edge (dashed vertical line) are extrapolated. They have dependencieswith the values inside the two dashed triangles.

classical wavelets like the 5/3 or the 9/7 can be found in [19]. After the final

update step, the odd coset contains the high-pass coefficients and the even coset

the low-pass ones.

Standard lifting is made shape adaptive by modifying (4.1) and (4.2) at lo-

cations where they would perform a weighted addition of two samples xt and xt′

separated by an edge. In that case, the sample xt′ which is not on the same side of

the edge as the sample yt being computed is considered missing and is extrapolated

from samples on the same side as yt. Designing effective extrapolation methods is

the topic of the following section.

71

4.4 Lifting Edge Handling

Our goal is to design extrapolation methods which lead to null high-pass coefficients

around edges. The symmetric extension [75] ensures the continuity of the signal

across the edge but not of its higher order derivatives. Therefore, even wavelets

which zero out high-order polynomials are not able to zero them out near edges.

The extrapolation design we now present is able to overcome this limitation.

In the following, we only consider extrapolation on the right side of edges, as

shown in Figure 4.4. Left-side extrapolation follows by symmetry. Let us assume

that there is an edge at T + 1/2 and that left of this edge the signal is polynomial

of degree L, that is,

xt =

L∑

k=0

αktk, ∀t ≤ T. (4.3)

In order for lifting to be performed, some of the samples on the right side of

the edge need to be estimated. In Figure 4.4, where T = 2t + 1, these samples

are x2t+2 and a2t+2. We choose to extrapolate them using a weighted sum of the

samples on the left of the edge, using an equation of the form

zT+1 =

L∑

k=0

µkzT−1−2k (4.4)

where z denotes the signal and µ the vector of unknown weights. The extrapolation

only relies on samples from the same coset to maintain invertibility and in-place

computation.

If the wavelet zeroes out polynomials of degree L, all high-pass coefficients left

of the edge should be zero. Among all these equations, only those near the edge

depend on µ. In Figure 4.4, this corresponds to y2t−1 and y2t+1. We write them

as functions of µ, λ, and T using (4.1), (4.2), (4.3), and (4.4). Setting them to

zero gives a system of polynomials in T . We want the solution to be invariant

72

to even shifts of the edge, which means that the polynomials must be identically

zero. Therefore all the polynomial coefficients are zero, which gives rise to L new

equations per polynomial. The unknown weights µ are obtained by solving this

system of equations.

Writing and solving the equations may be quite tedious but can be easily done

by mathematical software and the solutions are particularly simple. In the case

of 5/3 wavelets, which zeroes out linear polynomials, an extrapolation with L = 0

gives the standard symmetric extension with µ = 1 and an extrapolation with L =

1 gives a linear extension with µ = [ 2 −1 ]. In the case of the 9/7, which zeroes

out cubic polynomials, both of these extrapolations hold true and an extrapolation

with L = 3 gives a cubic extension with µ = [ 4 −6 4 −1 ]. The effectiveness

of this latter extrapolation is demonstrated in Figure 4.3.

The proposed method fails in two cases. First, when the edge location is of

the form 2t + 1/2, the unknown high-pass coefficient y2t+1 has a dependency with

the low-pass coefficient y2t but not with any high-pass coefficient on the left of the

edge. There is therefore no equation available to solve its extrapolation. However,

we can assume that the polynomial is perfectly fitted on the right side of the

edge, which means that y2t+1 is zero. The extrapolation also fails when there is an

insufficient number of samples between two edges. In that case, the order L of the

extrapolation is reduced.

4.5 Edge Representation and Coding

The SA-DWT requires an explicit knowledge of the edge locations. In 1D, these

edges are located at half-integer locations between samples. In 2D, they are lo-

cated between samples along rows and columns, which gives rise to an edge lattice

dual to the sample lattice, as shown in Figure 4.5. Edges are either horizontal,

73

Figure 4.5: Example of the dual lattices of samples and edges. Each edge indicatesthe statistical independence of the two half rows or half columns of samples itseparates.

splitting columns, or vertical, splitting rows. We represent them by two binary

edge maps, denoted respectively e(h)s+1/2,t and e

(v)s,t+1/2 where s and t denote integer

spatial locations.

Moreover, the SA-DWT is a multiresolution transform where each low-pass

band must be associated with a pair of edge maps. Let use denote by xs,t,j the

samples at resolution level j and by e(h)s+1/2,t,j and e

(v)s,t+1/2,j the associated depth

maps. The pyramid of edge maps is obtained by iterative downsampling using the

equations

e(h)s+1/2,t,j = max

(

e(h)2s+1/2,2t,j−1, e

(h)2s+1+1/2,2t,j−1

)

e(v)s,t+1/2,j = max

(

e(v)2s,2t+1/2,j−1, e

(v)2s,2t+1+1/2,j−1

)

. (4.5)

The edge maps at the finest resolution j = 0 are encoded using a differential

Freeman chain code inspired from [76]. Each edge along the chain can take one of

the four directions {right, up, left, down} numbered from 0 to 3. The direction of

the nth edge is denoted dn and the differential direction ∆n between two consecutive

edges is defined by

∆n = dn − dn−1. (4.6)

74

Directions are recovered iteratively from differential directions using the equation

dn = dn−1 + ∆dn mod 4 (4.7)

where mod denotes the modulo operator.

The chain code is made of a chain header which stores the location of the

first edge end-point and the first direction and a chain body which stores the

differential directions. In the example of Figure 4.5, the chain code is 14+½, 23+½;0 ; -1, 0, 0, +1, 0. Finally, the chain code is entropy coded to generate the

bitstream. Fixed-length codes are used for the header and simple variable-length

codes {−1 : 10, 0 : 0, +1 : 11} are used for the body, to take advantage of the

large number of zeros.

4.6 Experimental Results

We present experimental results on the Teddy set, shown in Figure 4.1. The DIBR

is 375×450 in size with intensities in the range [0, 255] and depths in [0, 53]. For

simplicity, only the luma channel of the image is considered. Missing values in the

depth maps have been interpolated using in-painting. Edges have been obtained

in a semiautomatic way by applying a Canny edge detector to the depth map and

letting the user choose which edge chains to keep.

We compare the performances of two codecs: one based on the DWT and the

other on the proposed SA-DWT with explicit edge coding. Both codecs perform

a five-level decomposition and rely on the same quantizer and entropy coder, pro-

vided by the SPIHT implementation of the QccPack library. They also rely on the

same 9/7 wavelet for the transform, which is the main wavelet in JPEG2000 [19].

Following [77], both codecs allocate 20% of the bitrate to the depth.

75

(a) Std9/7sym (b) SA9/7lin

Figure 4.6: Absolute values of the high-pass coefficients of the depth map usingstandard and shape-adaptive wavelets. The latter provides a much sparser decom-position.

The two codecs differ in their handling of edges: the DWT-based codec uses the

standard 9/7 wavelet with symmetric extension, denoted “std9/7sym,” while the

SA-DWT-based codec uses the shape-adaptive 9/7 wavelet with linear extension,

denoted “SA9/7lin,” for the depth map and the shape-adaptive 9/7 wavelet with

symmetric extension, denoted “SA9/7sym,” for the image. The SA-DWT-based

codec also includes an edge codec, as shown in Figure 4.2. The experiments are

based on the edge map shown in Figure 4.1, which has a bitrate overhead of

0.015 bpp.

Figure 4.6 shows the wavelet coefficients of the depth map obtained using

std9/7sym and SA9/7lin. The standard transform exhibits large values along

edges, which are absent from the shape-adaptive transform. The entropy of the

latter is therefore much reduced.

Figure 4.7 shows the reconstructed depth map at 0.04 bpp using the standard

and shape-adaptive codecs. Even at low bitrates, the latter reconstructs sharp

edges and avoids Gibbs artifacts along edges. Moreover, the sparser nature of the

SA-DWT coefficients means that the shape-adaptive codec is able to spend more

76

(a) Std9/7sym - 36.21 dB (b) SA9/7lin - 39.86 dB

Figure 4.7: Reconstruction of the depth map at 0.04 bpp using standard andshape-adaptive wavelets. The latter gives sharp edges free of Gibbs artifacts.

bits outside edge areas, which leads for instance to a slightly better reconstruction

of the bottom right of the depth map.

Figure 4.8(a) compares the RD performances of the two codecs for the depth

map. The bitrate consists in the wavelet coefficients, along with the edges in

the shape-adaptive case. The figure shows that the edge overhead is more than

compensated by the reduced entropy of the wavelet coefficients. This leads to

PSNR gains over the whole bitrate range, achieving up to 5.46 dB.

Figure 4.8(b) compares the RD performances of the two codecs for the image.

The bitrate is made of the wavelet coefficients in both cases, the edge overhead

having been accounted for in the depth bitrate. The figure shows PSNR gains over

the whole bitrate range, achieving up to 0.19 dB.

4.7 Conclusion

We have presented a novel codec of DIBRs for free-viewpoint 3D-TV. By replacing

the DWT of classical image codecs by a SA-DWT and adding an edge encoder, we

have been able to obtain significant PSNR gains, achieving up to 5.46 dB. Future

work shall consider the automatic extraction of edges based on RD considerations.

77

(a) Depth map (b) Image

Figure 4.8: Rate-distortion performances of standard and shape-adaptive wavelets.The latter gives PSNR gains of up to 5.46 dB on the depth map and 0.19 dB onthe image.

78

CHAPTER 5

3D MODEL-BASED FRAMEINTERPOLATION FOR DVC

5.1 Introduction

Distributed source coding (DSC) has gained interest for a range of applications such

as sensor networks, video compression, or loss-resilient video transmission. DSC

finds its foundation in the seminal Slepian-Wolf [78] and Wyner-Ziv [79] theorems.

Most Slepian-Wolf and Wyner-Ziv coding systems are based on channel coding

principles [80–86]. The statistical dependence between two correlated sources X

and Y is modeled as a virtual correlation channel analogous to binary symmet-

ric channels or additive white Gaussian noise (AWGN) channels. The source Y

(called the side information) is thus regarded as a noisy version of X (called the

main signal). Using error correcting codes, the compression of X is achieved by

transmitting only parity bits. The decoder concatenates the parity bits with the

side information Y and performs error correction decoding, i.e., MAP or MMSE

estimation of X given the received parity bits and the side information Y .

Compression of video streams can be cast into an instance of side information

coding, as shown by Aaron et al. [87–89] and Puri et al. [90–92]. These schemes

are also referred to as distributed video coding (DVC) systems. A comprehensive

survey of distributed video compression can be found in [93]. One key aspect in the

performance of the system is the mutual information between the side information

and the information being Wyner-Ziv encoded. In current approaches, the side

information is generated via motion-compensated frame interpolation, often using

79

block-based motion compensation (BBMC) [93]. Motion fields are first computed

between key frames, which may be distant from one another. An interpolated

version of these motion fields is then used to generate the side information for

each WZ frame. The frame interpolation based on these interpolated motion fields

is not likely to lead to the highest possible PSNR, hence to the highest mutual

information between the side information and the Wyner-Ziv encoded frame. To

cope with these limitations, BBMC is embedded in a multiple motion hypothesis

framework in [93,94]. The actual motion vectors are chosen by testing the decoded

frames against hash codes or CRCs.

Here, we address the problem of side information generation in distributed cod-

ing of videos captured by a camera moving in a 3D static environment with Lam-

bertian surfaces. This problem is of particular interest to specialized applications

such as augmented reality, remote controlled robots operating in hazardous envi-

ronments, and remote exploration by drones or planetary probes. We explore the

benefits of more complex motion models belonging to the Structure-from-Motion

(SfM) paradigm [16]. These motion models exhibit strong geometrical properties,

which allow their parameters to be robustly estimated. Note that, unlike predic-

tive coding, DVC has the advantage of not requiring the transmission of motion

model parameters. Therefore, increasing the complexity of motion models, and

thus their ability to accurately represent complex motions, offers potential gains

in mutual information without additional bitrate overheads.

When used in computer vision applications, SfM approaches aim at generating

visually pleasing virtual views [8]. On the other hand, when used in DVC, the

objective is to generate intermediate frames (the side information) with the highest

PSNR. This requires a reliable estimation of the camera parameters as well as

sub-pel precision of the reprojected 3D model, especially in edge regions where

even small misalignments can have a strong impact on the PSNR. In addition,

80

constraints on latency in applications such as video streaming, as well as memory

constraints, prevent the reconstruction of the 3D scene from all the key frames at

once, as is usually done in SfM. Instead, a sequence of independent 3D models is

reconstructed from pairs of consecutive key frames.

In this chapter, we first describe two 3D model-based frame interpolation meth-

ods relying on SfM techniques, one block-based and one mesh-based, both being

constrained by the epipolar geometry. These first approaches suffer from the follow-

ing limitation. The motion fields associated with the intermediate frames are inter-

polated with the classical assumption of linear motion between key frames. This

creates misalignments between the side information and the actual WZ frames,

which have a strong impact on the rate-distortion (RD) performances of the 3D

model-based DVC solution.

This observation led us to introduce two methods to estimate the intermediate

motion fields with the help of point tracks, instead of interpolating them. The

motion fields are thus obtained by computing the camera parameters at interme-

diate time instants using point tracks. A first technique relies on feature point

tracking at the decoder, each frame being processed independently at the encoder.

In addition to the key frames, the encoder extracts and transmits a limited set

of feature points which are then linked temporally at the decoder. Feature point

tracking at the decoder greatly reduces misalignments, thereby increasing the side

information PSNR with, in turn, a significant impact on the RD performances of

the 3D model-based DVC system. These performances are then further improved

by introducing the tracking at the encoder. The encoder thus shares some limited

information between frames under the form of intensity patches to construct the

tracks sent to the decoder. The latter method has the additional advantage of giv-

ing the encoder a rough estimation of the video motion content, which is sufficient

to decide when to send key frames. Note that the problem of key frame selection

81

has already been studied in the context of SfM [95] and predictive coding [96].

However, it relied on epipolar geometry estimation at the encoder, which DVC

cannot afford. An alternative to tracking has been proposed in [97], where the

authors advocate the use of statistics on intensities and frame differences.

The remainder of the chapter is organized as follows. Section 5.2 presents the

estimation of the 3D model, while Sections 5.3 and 5.4 describe the model-based

frame interpolation, using the assumption of linear motions in the former and point

tracks in the latter. Finally, Section 5.5 presents our experimental results.

5.2 3D Model Construction

5.2.1 Overview

We begin by presenting a codec based on the assumption of linear motions, that is,

without point tracking. This codec, called 3D-DVC, derives from the DVC codec

described in [93, 98], as outlined in Figure 5.1. At the encoder, the input video

is split into groups of pictures (GOP) of fixed size. Each GOP begins with a key

frame, which is encoded using a standard intracoder (H.264-intra in our case) and

then transmitted. The remaining frames (WZ frames) are transformed, quantized,

and turbo-encoded. The resulting parity bits are punctured and transmitted.

At the decoder, the key frames are decompressed and the side information is

generated by interpolating the intermediate frames from pairs of consecutive key

frames. The turbo-decoder then corrects this side information using the parity

bits. The proposed decoder differs from classical DVC by its novel 3D model

construction and model-based frame interpolation.

In this section, we first describe the 3D model construction, whose overall

architecture is presented in Figure 5.2. Unlike the SfM techniques it extends,

82

Figure 5.1: Outline of the codec without point tracking (3D-DVC). The proposedcodec benefits from an improved motion estimation and frame interpolation (grayboxes).

Figure 5.2: Outline of the 3D model construction.

the proposed model construction focuses on the PSNR of the interpolated frames

to maximize the quality of the side information. Toward that goal, we present

a novel robust correspondence estimation with subpixel accuracy. In particular,

correspondences are scattered over the whole frames and are dense in areas of high

gradients. Furthermore, the 3D model construction is robust to quantization noise.

After introducing some notation, we shall first describe the camera parameter

estimation and then the correspondence estimation.

5.2.2 Notation

We shall use the typesettings a, a,A to denote respectively scalars, column vectors,

and matrices. In the following, a(t)j (i) denotes the jth scalar-entry of the ith vector

of a set at time t. Likewise for matrices, Aij denotes the scalar entry at the ith row

and jth column, while Ai: represents its ith row vector. Moreover, A⊺ denotes the

83

transpose of matrix A, As the column vector obtained by stacking the A⊺

i: together,

and [.]× the cross-product operator. The identity matrix will be denoted by I and

the norms 1 and 2 respectively by ‖.‖1 and ‖.‖2. We shall use homogeneous vectors,

where x , (x, y, 1)⊺ and X , (x, y, z, 1)⊺ represent respectively a 2D and a 3D point.

These entities are defined up to scale, i.e., (x, y, 1)⊺ is equivalent to (λx, λy, λ)⊺ for

any nonnull scalar λ. Without loss of generality, the two key frames are assumed

to have been taken at times t = 0 and t = 1 and are respectively denoted by 0I

and 1I.

5.2.3 Camera parameter estimation

We assume here that an initial set of point correspondences {x(0), x(1)} between the

two key frames is available. It is used to estimate both the camera parameters,

e.g., translation and rotation between the key frames, and the depth associated

with each point correspondence.

Robust weak-calibration

The assumption of static scene introduces a constraint on correspondences given

by the equation

x(1)⊺ F x(0) = 0 (5.1)

where F is the so-called fundamental matrix from which the camera parameters

shall be extracted.

The robust weak-calibration procedure aims at estimating this fundamental

matrix. As an additional feature, it identifies erroneous correspondences. It con-

sists of three steps:

1. an initial estimation of F and the set of inliers using MAPSAC [99],

84

2. a first refinement of F and the set of inliers using LO-RANSAC [100],

3. a second refinement of F over the final set of inliers by a nonlinear minimiza-

tion of the Sampson distance [16].

Quasi-Euclidean self-calibration and triangulation

The next step is to recover the projection matrices P and the depths λ. We can

choose the world coordinate system (WCS) of the 3D scene to be the camera

coordinate system at time t = 0, leading to P(0) = [I 0]. This leaves four degrees

of freedom in the WCS. They appear in the relation between F and the projection

matrix of the second key frame P(1) , [R(1) t(1)], given by

t(1) ∈ ker (F⊺) and R(1) = [t(1)]×F− t(1) a⊺ (5.2)

where a is an arbitrary 3-vector and t(1) has arbitrary norm. For the time being,

these degrees of freedom are fixed by choosing t(1) with unit norm and setting

a = t(0), where the epipoles t(0) and t(1) are recovered from the singular value

decomposition (SVD) of the matrices F and F⊺, respectively [16]. Since projec-

tion matrices are defined up-to-scale, in the remainder of the chapter they are

normalized so that their Frobenius norm will be√

3.

The depths can then be recovered. Let a 3D point X project itself onto a 2D

point x on the camera image plane. These two points are related by λx = PX

where λ is the projective depth. Therefore, correspondences allow the recovery of

a cloud of 3D points by triangulation, solving the system of equations

λ(1)x(1) = λ(0)R(1)x(0) + t(1) (5.3)

for each correspondence.

85

The initial choice of WCS is refined by quasi-Euclidean self-calibration [101],

so that the WCS is as Euclidean as possible. We also constrain the depths in

the new WCS to be bounded by 1 and M to reduce numerical issues during the

bundle adjustment detailed later. Assuming that the camera parameters cannot

undergo large variations between key frames, we look for a matrix R(1) as close

as possible to the identity matrix and compatible with the fundamental matrix F.

The optimal vector a is then found by minimizing∥

∥R(1) − t(1) at − I∥

1under the

linear constraints

max(λ(0), λ(1))/M ≤ λ(0)x(0)⊺a− 1 ≤ min(λ(0), λ(1)). (5.4)

A lower bound on the value of M is given by max(λ(0), λ(1))/ min(λ(0), λ(1)). The

self-calibration starts with this value and increases it until the linear programming

problem admits a solution.

Bundle adjustment

The camera parameters and the depths obtained so far suffer from the bias inherent

to linear estimation [102]. They can be refined by minimizing their Euclidean

reprojection error on the key frames. First, the basis of the projective space has

to be fixed to prevent it from drifting. This is done by fixing two 3D points

and performing the optimization over a reduced parameter space. As shown in

Appendix A, the 12-dimensional projection matrix P(1) can be expressed as a

linear combination of an 8-dimensional unit vector r, i.e., P(1)s =√

3Wr where W

is an orthonormal matrix. The two fixed 3D points are chosen randomly under the

constraints that they have small reprojection errors and that they are far from the

epipoles and from each other.

86

The minimization of the Euclidean reprojection error is defined as

min{λ(0)},r

J(

{x(0)}, {λ(0)}, {x(1)},√

3Wr)

such that ‖r‖2 = 1

(5.5)

where J is the euclidean reprojection error given by

J(

{x(0)}, {λ(0)}, {x(t)},P(t)s)

,

i

(

x(t)(i)−[

λ(0)(i)x(0)⊺(i) 1]

P(t)s1:4

[λ(0)(i)x(0)⊺(i) 1]P(t)s9:12

)2

+∑

i

(

y(t)(i)−[

λ(0)(i)x(0)⊺(i) 1]

P(t)s5:8

[λ(0)(i)x(0)⊺(i) 1]P(t)s9:12

)2

.

(5.6)

The minimization is solved using an alternated reweighted linear least square ap-

proach [103], as detailed in Appendix B. A refined fundamental matrix is then

deduced using F = [t(1)]×R(1) and refined point correspondences are obtained us-

ing reprojection.

5.2.4 Correspondence estimation

We now turn to the estimation of the set of point correspondences {x(0), x(1)}

between the two key frames.

Feature point detection

First, feature points {x} are detected on each key frame independently. We use

the Harris-Stephen corner detector [104] to find feature points. Its sensitivity is

adapted locally to spread feature points over the whole frame [16], which improves

the weak-calibration detailed previously.

87

Feature point matching

Feature points are then matched across key frames to form correspondences. All

pairs of feature points{

x(0), x(1)}

are considered as candidate correspondences. A

first series of tests eliminates blatantly erroneous correspondences:

1. Correspondences with very large motions are discarded.

2. Correspondences with dissimilar intensity distributions in the neighborhoods

around feature points are discarded. Distributions are approximated using

Parzen windows [105, §4.3] and sampled uniformly to obtain histograms. The

similarity of histograms is tested using the χ2-test [66, §14.3].

Subpixel refinement

The locations of the remaining correspondences are then refined locally by search-

ing for the least mean square error (MSE) between neighborhoods around feature

points. The minimization is solved using the Levenberg-Marquardt algorithm [66,§15.5]. This refinement compensates for errors from the Harris-Stephen detector,

leads to values of MSE independent of image sampling [106], and computes feature

point locations with subpixel accuracy.

Outlier removal

A second series of tests is applied to eliminate erroneous correspondences. These

tests are performed when camera parameter estimation provides the necessary

information (see Figure 5.2).

1. MSE: Correspondences with large MSE are discarded.

2. Unicity: Each feature point is only allowed to belong to at most one cor-

respondence. This unicity constraint is enforced using the Hungarian algo-

88

rithm [68, §I.5], which keeps only the correspondences with the least MSE

when the unicity constraint is violated.

3. Epipolar: Correspondences identified as outliers during the robust weak-ca-

libration are discarded.

4. Motion: Correspondences with aberrant motion are removed by testing the

difference between their motion and the weighted median of the motions of

their neighbors positioned on Delaunay triangles [107, Chap. 9]. Neighbors

are assigned weights inversely proportional to their distances.

5. Reprojection: The projection of the 3D points must be close to the actual

2D points.

6. Chirality: All products of projective depths λ(0)λ(1) must have the same

sign [108, Th. 17],

7. Epipole: Correspondences must be far from epipoles, since triangulation is

ill-conditioned around them.

8. Depth: Points with aberrant depths compared to the depths of their neigh-

bors are removed.

Correspondence propagation

The set of correspondences obtained so far is reliable and accurate but still fairly

sparse. It is first densified over unmatched feature points by increasing the toler-

ance of the tests described previously and enforcing motion smoothness using the

weighted median of neighboring motions.

Correspondences are then propagated along edges, under the epipolar con-

straint. The goal of this procedure is to get accurate motion information in edge

89

regions, where even a slight misalignment can lead to large MSE, degrading the

side information PSNR.

The intersections of edges with epipolar lines define points, except where epipo-

lar lines and edge tangents are parallel. Therefore, correspondences can be obtained

not only at feature points, but also along edges. Edges are found in the first key

frame using the Canny edge detector [12]. Correspondences are propagated along

edges, starting from correspondences between the feature points. At each itera-

tion, edge-points around previously known correspondences are selected and their

motions are initialized to those of their nearest neighbors. Their motions are then

improved by full search over small windows along associated epipolar lines, mini-

mizing the MSE between intensity neighborhoods. Their motions are finally refined

by a Golden search [66, §10.1] to obtain subpixel accuracy. The robustness of this

procedure is increased by removing edge-points too close to the epipole, as well as

those whose edge tangents are close to epipolar lines or which have large MSE.

5.2.5 Results

Figures 5.3(a) and 5.3(b) show the correspondences obtained between the first

two key frames of the sequences street and stairway, after respectively robust

weak-calibration and correspondence propagation. In both cases, the epipolar

geometry was correctly recovered and correspondences are virtually outlier-free.

Moreover, propagation greatly increases the correspondence density, from 394 cor-

respondences to 9982 for the street sequence and from 327 correspondences to

6931 for the stairway sequence. The two figures also underline some intrinsic lim-

itations of the SfM approach. First, the street sequence has epipoles inside the

images, as can be seen by the converging epipolar lines. Since triangulation is

singular around the epipoles, there are no correspondences in their neighborhoods.

90

(a) Epipolar geometry after robust weak calibration

(b) Correspondences after propagation along edges

Figure 5.3: Correspondences and epipolar geometry between the two first losslesskey frames of the sequences street and stairway. Feature points are represented byred dots, motion vectors by magenta lines ending at feature points, and epipolarlines by green lines centered at the feature points.

Second, the stairway sequence contains strong horizontal edges whose tangents are

nearly parallel to the epipolar lines. This explains why so few correspondences

were found in this region, while the wall is covered by correspondences.

5.3 3D Model-Based Interpolation

The frame interpolation methods developed in this chapter rely on the projection

of the 3D scene onto camera image planes. This projection requires the knowledge

of the projection matrices associated with the interpolated frames as well as the

knowledge of a dense motion field between the frame being interpolated and each

of the two key frames. We consider two motion models to obtain dense motion

91

fields from the correspondences and the projection matrices: one block-based and

one mesh-based, both being constrained by the epipolar geometry.

5.3.1 Projection-matrix interpolation

The projection matrices at intermediate time instants can be recovered by gener-

alizing the bundle-adjustment equation (Equation 5.5) to three or more frames

min{λ(0)},{P(t)s}

t

J(

{x(0)}, {λ(0)}, {x(t)},P(t)s)

such that

P(1)s =√

3Wr

‖r‖22 = 1

‖P(t)s‖22 = 3, 0 < t < 1

.

(5.7)

In this equation, the projection matrices P(t) are independent of one another.

They are therefore solutions of simple reweighted linear least square problems.

Since the locations {x(t)(i)} of the feature points on the intermediate frames

are unknown to the decoder, they need to be interpolated by assuming for instance

that they undergo linear motions

x(t)(i) = (1− t)x(0)(i) + tx(1)(i). (5.8)

The codec based on this assumption is called 3D-DVC. Section 5.4 will present

two other codecs that make use of additional information from the encoder to avoid

this assumption.

5.3.2 Frame interpolation based on epipolar blocks

In the first motion model, each intermediate frame is divided into blocks whose

unknown texture is to be estimated. The search space of the motion vectors

92

Figure 5.4: Trifocal transfer for epipolar block interpolation.

is limited by the epipolar constraints and trifocal transfer [109]. As shown in

Figure 5.4, given a block located at x(t) in the intermediate frame, its corresponding

blocks in the key frames lie along the epipolar lines l(0) and l(1). For a given

candidate location in a reference key frame, say x(0) in I(0), the location of the

corresponding block x(1) in the other key frame I(1) is uniquely defined via trifocal

transfer: triangulation of points x(0) and x(t) gives the 3D point X, which is then

projected onto I(1) to give x(1). The key frame whose optical center is the farthest

away from the optical center of the interpolated frame is chosen as the reference

key frame, so that the equations of trifocal transfer are best conditioned.

As outlined in Figure 5.5, the algorithm initializes the motions of the blocks

using the motions of the nearest correspondences. It then refines them by mini-

mizing the MSE of the block textures in the key frames, using a local full search

along the epipolar lines followed by a Golden search [66, §10.1] to obtain subpixel

accuracy. Since trifocal transfer is singular around epipoles, the motions of the

blocks too close to the epipoles are not refined. Finally, the block textures from

the key frames are linearly blended based on time to obtain the texture of the

block in the interpolated frame.

93

Figure 5.5: Outline of the frame interpolation based on epipolar blocks.

5.3.3 Frame interpolation based on 3D meshes

In the second motion model, the first key frame is divided into blocks which are

themselves subdivided into pairs of triangles, thus forming a triangular mesh. Each

vertex i is associated with two locations{

x(0)(i), x(1)(i)}

, one in each key frame.

Due to the epipolar geometry, the second location is constrained to lie on an

epipolar line such that

x(1)(i) = q(i) + λ(i)t(i) (5.9)

where t(i) is a line tangent vector, q(i) a point on the line, and λ(i) a scalar. All

these quantities are stacked together to form a matrix T and two vectors q and λ.

Likewise, the point correspondences obtained in Section 5.2 are stacked into two

location vectors x(0) and x(1).

As outlined in Figure 5.6, the mesh is first fitted to the set of correspondences.

Mesh fitting is cast into a minimization problem using a Tikhonov regularization

approach [110]. The motion inside each triangle is assumed to be affine, which

approximates the projection of a piecewise-planar 3D mesh. Therefore, the motion

of any point in the first key frame can be written as a linear combination of

the motions of the mesh vertices. Let us represent these linear combinations by

the matrix M. The mesh also has an internal smoothness, which favors small

differences between the motion of a vertex and the average motion of its four

neighbors. Since the average is a linear operation, it can be represented by a matrix

94

Figure 5.6: Outline of the frame interpolation based on 3D meshes.

N. Let µ be a scalar controlling the smoothness of the mesh. The minimization

problem is then

minλ‖x(1) −M(q + Tλ)‖22 + µ2‖ (I−N) (q + Tλ)‖22. (5.10)

This is a linear least-square (LLS) problem, which can readily be solved.

Since LLS is not robust to outliers, an additional step removes them. Outliers

are detected by testing whether the mesh triangles abide by these two criteria:

1. They must have the same orientation in both key frames.

2. The motion compensation errors must be small.

Correspondences inside triangles failing these tests are considered outliers. Once

they have been removed, the mesh is fitted again. This process is iterated until all

triangles pass the tests.

Finally, the mesh is reprojected onto intermediate frames using trifocal transfer.

The key frames are warped using 2D texture mapping [65] and linearly blended

based on time.

5.3.4 Comparison of the motion models

Epipolar block motion fields approximate well depth discontinuities but only pro-

vide a fronto-parallel approximation of the 3D surfaces. On the other hand, mesh-

based motion fields are able to approximate 3D surfaces with any orientations and

95

(a) Epipolar block matching (b) 3D mesh

(c) Classical block matching

Figure 5.7: Norm of the motion vectors between the first two lossless key framesof the stairway sequence for epipolar block matching (a), 3D mesh fitting (b), andclassical 2D block matching (c).

are more robust to outliers due to their internal smoothness. At the same time

they tend to over-smooth depth discontinuities and they do not model occlusions.

These properties are clearly visible in Figure 5.7, which shows the norm of the mo-

tion vectors on the stairway sequence obtained by the two motion estimations. The

same figure also displays the motion field obtained using classical 2D block-based

motion estimation. In comparison, both proposed motion estimations exhibit a

reduced number of outliers.

96

5.4 3D Model-Based Interpolation with Point

Tracking

5.4.1 Rationale

The fact that the above 3D-model based interpolation techniques barely increase

the PSNR of the side information comes from the underlying assumption that the

tracks have linear motion (5.8) during the estimation of the intermediate projection

matrices (5.7), which gives inaccurate projection matrices. Since the motion fields

are obtained by projecting 3D points or a 3D mesh onto image planes, inaccurate

projection matrices lead to misalignments between the interpolated frames and the

actual WZ frames. These misalignments then create large errors in regions with

textures or edges, which penalize the PSNR.

Instead of linear interpolating correspondences to obtain tracks, it is proposed

here to detect actual tracks from the original frames. The linear motion assumption

represented by (5.8) is thus not used any more. We propose two methods to achieve

this goal: one tracking points at the decoder and one tracking them at the encoder.

These methods lead to two DVC codecs referred to as 3D-DVC-TD and 3D-DVC-

TE, respectively. In both methods, a set of feature points is extracted at the

encoder with a Harris-Stephen feature-point detector [104]. When the tracking is

performed at the decoder, the set of feature points is encoded and transmitted to

the decoder. When the tracking is done at the encoder, a list of tracked positions

per feature point is encoded and transmitted. Unlike previous works [93, 94], no

information is sent about the actual intensities of the WZ frames.

Computing and transmitting the feature points or tracks introduces overheads

on the encoder complexity and on the bandwidth. However, these overheads are

minor because only a small number of feature points is required to estimate the

eleven parameters of each intermediate projection matrix. Moreover, in the case of

97

Figure 5.8: Outline of the codec with tracking at the decoder (3D-DVC-TD).

tracking at the encoder, statistics on tracks allow the encoder to select key frames

based on the video motion content, thus increasing bandwidth savings.

5.4.2 Tracking at the decoder

This codec, called 3D-DVC-TD, builds upon the 3D-DVC solution based on the

frame interpolation techniques presented in Section 5.3, and includes in addition

a Harris-Stephen feature-point detector [104] and a feature point encoder (see

Figure 5.8). The decoder receives these points and matches them to form tracks. It

starts by matching points between key frames using three constraints: the epipolar

geometry, motion smoothness with previously matched correspondences, and the

MSE. It then creates snakes [111] between key frames assuming a linear motion

and optimizes them to fit the set of received points. The optimization procedure

solves an LLS problem with equations similar to those of mesh fitting (5.10). To

make it more robust to outliers, points too far away from the snakes are ignored

and those close enough are given weights which decrease as the square of their

distance to the snakes.

The set of feature points is then transformed into a list Lt = {xt(i), yt(i)}, i =

1 . . .N , where N is the number of feature points considered (on average N = 176 in

98

Figure 5.9: Outline of the codec with tracking at the encoder (3D-DVC-TE).

the experiments reported in the chapter), by scanning the image column by column.

Due to the chosen scanning order, the horizontal component of the feature-point

coordinates varies slowly, so it is encoded using differential pulse code modulation

(DPCM) followed by arithmetic coding. On the other hand, the vertical component

varies rapidly and is encoded using fixed-length codes.

5.4.3 Tracking at the encoder

Similarly, this codec, called 3D-DVC-TE, extends the 3D-DVC solution presented

in Section 5.3 by adding the Harris-Stephen feature-point detector [104], a point

tracker and a point-track encoder, at the encoder (see Figure 5.9). Therefore,

unlike the two previous codecs, limited information is shared between the frames.

The encoder detects feature points on the current key frame and tracks them

in the following frames until one of the following two stopping criteria is met:

either the length of the longest track becomes large enough or the number of lost

tracks becomes too large. The former criterion enforces that key frames sufficiently

differ from one another, while the latter criterion ensures that the estimation of

intermediate projection matrices is always a well-posed problem. Once a stopping

criterion is met, a new key frame is transmitted and the process is reiterated.

99

Tracking relies on the minimization of sum of absolute differences (SAD) between

small blocks around point tracks. The minimization only considers integer pixel

locations and is biased toward small motions to avoid the uncertainty due to large

search regions. It begins by a spiral search around the location with null motion.

Once a small SAD is detected, it continues by following the path of least SAD,

until a local minimum is found. Tracks for which no small SAD can be found are

discarded.

For each tracked feature point of coordinate (i, j), a list Lx(i, j) = {xt, yt} of

its tracked position at the different instants t is formed. This list of coordinates is

then encoded using a DPCM followed by arithmetic coding.

5.5 Experimental Results

We have assessed the performance of the 3D-DVC incorporating the two interpo-

lation methods based on the SfM paradigm as well as the variants of this codec

augmented with the feature point tracking either at the encoder (3D-DVC-TE) or

at the decoder (3D-DVC-TD). These codecs were implemented by replacing the

frame interpolation of the 2D DVC codec [98] and adding point tracking. The key

frame frequency was estimated automatically in 3D-DVC-TE and set to one key

frame every 10 frames in 3D-DVC and 3D-DVC-TD.

Experimental results are presented on three sequences: street, stairway, and

statue. The first two, shown in Figure 5.3, are CIF at 30 Hz with 50 frames. The

third, shown in Figure 5.10, is CIF 25 Hz with 50 frames. These sequences contain

drastically different camera motions, as can be seen from the motion vectors and

the epipolar geometries. In the first one, the camera has a smooth motion, mostly

forward. In the second one, the camera has a lateral translational motion with

hand jitter, creating motions of up to 7 pixels between consecutive frames. In the

100

(a) Lossless key frames

(b) Quantized key frames (QP42)

Figure 5.10: Correspondences and epipolar geometry between the two first keyframes of the sequence statue. Feature points are represented by red dots, motionvectors by magenta lines ending at feature points, and epipolar lines by green linescentered at the feature points.

last one, the camera has a lateral rotational motion with hand jitter, which creates

a large occlusion area around the statue.

5.5.1 Frame interpolation without tracking (3D-DVC)

In DVC, the key frames are first quantized and encoded. It is thus essential

to assess the performance of the different techniques designed in this context.

Figure 5.10 shows that the 3D model estimation behaves well even with coarsely

quantized key frames. The motion vectors and the epipolar geometries at QP42

101

remain similar to those of lossless coding, the major difference lying in the density

of correspondences.

Figure 5.11 shows the PSNR of the interpolated frames obtained with the differ-

ent interpolation methods. The only introduction of the epipolar or 3D geometry

constraints in the interpolation process does not have a significant impact on the

PSNR of the interpolated frames. This can be explained by the fact that the result-

ing interpolated motion fields create misalignments between the side information

and WZ frames (see Figure 5.12). We will see in Section 5.5.4 that this translates

into poor RD performances of the 3D-DVC solution.

5.5.2 Frame interpolation with tracking at the encoder(3D-DVC-TE)

Figure 5.11 shows that 3D frame interpolation aided by point tracks consistently

outperforms both 3D frame interpolation without point tracks (3D-DVC) and clas-

sical 2D block matching (2D-DVC), bringing at times improvements of more than

10 dB. This results from the fact that misalignments between the side informa-

tion and WZ frames are greatly reduced by estimating the intermediate projection

matrices from actual tracks, instead of assuming linear track motions (see Fig-

ure 5.12). Table 5.1 summarizes the average PSNR of the different interpolation

methods. It shows that, when used jointly with the feature point tracking to

correct misalignments, the mesh-fitting interpolation method is superior to the

epipolar block-based method in both sequences, bringing average PSNR gains up

to 0.7 dB.

Tracking has two drawbacks: it introduces a bit-rate overhead and increases

the coder complexity. The bit-rate overhead represents less than 0.01 b/pixel.

Compared to classical 2D BBMC coders, the complexity overhead is very limited

due to the small number of tracks. Assuming 8 × 8 blocks for 2D BBMC, a CIF

102

5 10 15 20 25 30 35 40 45 5010

15

20

25

30

35

40

�������������� ��������������������������� ���������������������������������

������������������ ������������������������������������������������������ ������������������������������������

5 10 15 20 25 30 35 40 45 5010

15

20

25

30

35

40

Frames

Frames

��� ��!"�

Frames

��� ��!"�

��� ��!"�

5 10 15 20 25 30 35 40 45 5010

15

20

25

30

35

40

Figure 5.11: PSNR of interpolated frames using lossless key frames (from top tobottom: sequences street, stairway, and statue). Missing points correspond to keyframes (infinite PSNR).

103

(a) 2D-DVC (b) 3D-DVC

(c) 3D-DVC-TE (d) 3D-DVC-TD

Figure 5.12: Correlation noise for GOP 1, frame 5 (center of the GOP) of thestairway sequence, using lossless key frames: 2D-DVC with classical block match-ing (a), 3D-DVC with mesh model and linear tracks (b), 3D-DVC-TE with meshmodel and tracking at the encoder (c), and 3D-DVC-TD with mesh model andtracking at the decoder (d). The correlation noise is the difference between theinterpolated frame and the actual WZ frame.

Table 5.1: Average PSNR (in dB) of interpolated frames using lossless key frames.street stairway statue

2D-DVC (classical block match.) 22.3 19.0 21.33D-DVC (epipo. block match.) 23.3 20.0 22.23D-DVC (mesh fitting) 23.3 20.1 22.33D-DVC-TD (epipo. block match.) 28.4 24.7 25.53D-DVC-TD (mesh fitting) 28.9 25.0 25.63D-DVC-TE (epipo. block match.) 28.6 28.0 27.43D-DVC-TE (mesh fitting) 29.3 28.5 27.6

104

5 10 15 20 25 30 35 40 45 5020

25

30

35

40

PS

NR

(dB

) ����

����

����

����

������� ��

Figure 5.13: PSNR of key frames and interpolated frames of the street sequenceusing 3D-DVC-TE with mesh fitting on lossy key frames. Peaks correspond to keyframes.

frame has (352/8)×(288/8) = 1584 blocks. On the other hand, the average number

of tracks is 128. Therefore, the complexity of the proposed tracking is only 8% of

a 2D BBMC.

Figure 5.13 shows the robustness of the proposed frame interpolation to quan-

tization noise, the quality of the interpolated frames degrading gracefully as the

bitrate is decreased. In the context of quantization, a larger quantization step size

decreases both the bitrate and the PSNR. However, it is not the case with frame in-

terpolation: a larger bitrate does not necessarily imply a larger PSNR. This figure

also shows that the PSNR of interpolated frames actually decreases more slowly

than that of key frames. Since quantization reduces the high-frequency content

of the key frames, it reduces the impact of interpolation misalignments. It also

reduces the impact of the low-pass effects of warping, both spatial and temporal.

The PSNR of interpolated frames has a ceiling value at about 30 dB. It is

quickly attained as the QP of key frames is decreased: this PSNR is about the

same whether key frames are losslessly encoded or quantized at QP = 26. In terms

of bitrate allocation, it means that in order to reach higher PSNR, more bits need

to be spent on parity bits.

Finally, although the objective quality strongly peaks at key frames, the sub-

jective quality is nearly constant, as shown in Figure 5.14. Both sources of errors,

105

(a) Frame 1 (PSNR: 36.5dB)

(b) Frame 5 (PSNR: 28.4dB)

Figure 5.14: Variation of the subjective quality at QP = 26 between a key frame(frame 1) and an interpolated frame (frame 5). In spite of a PSNR drop of 8.1 dB,both frames have a similar subjective quality.

misalignments and low-pass effects, are barely noticeable. This does not mean,

however, that they are not a limiting factor of the codec overall performances

because parity bits correct objective errors, not subjective ones.

5.5.3 Frame interpolation with tracking at the decoder(3D-DVC-TD)

Figures 5.11 and 5.12 show that point tracking at the decoder is also able to

greatly reduce misalignments and to consistently outperform 2D-DVC and 3D-

DVC. PSNR values obtained by 3D-DVC-TD are nearly constant inside each GOP

on the street sequence. The superiority of tracking at the encoder (3D-DVC-TE) is

in part due to the possibility of inserting new key frames and restarting tracking in

106

case of difficult motions. Like in the 3D-DVC-TE case, overheads are also limited.

An average of 176 feature points are detected at the encoder, which leads to a

bitrate overhead of 0.02 b/pixel. The complexity overhead is similar to the one of

intracoding.

5.5.4 Rate-distortion performances

Figure 5.15 compares the rate-distortion performances of the proposed mesh-based

codecs with version JM 9.5 of the H.264/AVC reference software and the 2D-DVC

software presented in [98]. The three proposed 3D codecs only differ from this

2D-DVC software by their side information generation. The 2D DVC software has

GOPs I-WZ-I, while H.264/AVC is tested in three modes: pure intra, inter-IPPP

with motion search, and IPPP with null motion vectors.

The 3D codecs with alignment (3D-DVC-TD and 3D-DVC-TE) strongly out-

perform the 3D codec without alignment (3D-DVC), confirming the need for precise

motion alignment. Compared to 2D-DVC, 3D-DVC-TE is superior over the whole

range of bitrates on all sequences while 3D-DVC-TD is superior on the street se-

quence over the whole range of bitrates, on the stairway sequence up to 990 kb/s

and on the statue sequence up to 740 kb/s. Note, however, that this RD gain

was achieved at the expense of the generality of the codec, 2D-DVC being able to

handle sequences with generic 2D motions.

Finally, compared to H.264/AVC, both 3D codecs with alignment outperform

intracoding and underperform intercoding with motion search. Since 3D-DVC-TE

benefits from limited interframe information at the encoder, it is also compared

to intercoding without motion search. The 3D codec is superior for bitrates up to

890 kb/s on the street sequence, over the whole range of bitrates on the stairway

sequence, and up to 1.4 Mb/s on the statue sequence.

107

0 0.5 1 1.520

22

24

26

28

30

32

34

36

Total rate (Mb/s)

PS

NR

(dB

)

0 0.5 1 1.520

22

24

26

28

30

32

34

36

Total rate (Mb/s)

�������������

���������

��������� ��������

���������

������

���������

���������

0 0.5 1 1.520

22

24

26

28

30

32

34

36

PS

NR

(dB

)

Total rate (Mb/s)

Figure 5.15: Rate-distortion curves for H.264/AVC intra, H.264/AVC inter-IPPPwith null motion vectors, H.264/AVC inter-IPPP, 2D-DVC I-WZ-I, and the threeproposed 3D codecs (top left: street, top right: stairway, bottom: statue).

108

5.6 Conclusion

In this chapter, we have explored the benefits of the SfM paradigm for distributed

coding of videos of static scenes captured by a moving camera. The SfM approach

allows introducing scene geometrical constraints in the side information generation

process. We have first developed two frame interpolation methods based either on

block matching along epipolar lines or 3D-mesh fitting. These techniques make use

of a robust feature-point matching algorithm leading to semidense correspondences

between pairs of consecutive key frames. The resulting interpolated motion fields

show a reduced number of outliers compared with motion fields obtained from 2D

block-based matching and interpolation techniques. It has been observed that this

property does not translate into significant side information PSNR gains, because

of misalignments problems betweens the side information and the Wyner-Ziv en-

coded frames. This limitation has been overcome by estimating the intermediate

projection matrices from point tracks obtained either at the encoder or at the

decoder, instead of interpolating it. It has led to major side information PSNR

improvements with only limited overheads, both in terms of bitrate and encoder

complexity. As an additional feature, point tracking at the encoder enables a rough

estimation of the video motion content, which is sufficient to select the key frames

adaptively. The RD performance of the resulting three DVC schemes has been

assessed against several techniques, showing the benefits of the 3D model based

interpolation methods augmented with feature point tracking for the type of ap-

plication and content considered. Further studies would be needed to extend the

proposed frame interpolation technique to videos with more generic motion fields

and to assess such methods against solutions in which limited motion search would

be considered at the encoder.

109

5.7 Acknowledgments

This work has been done in collaboration with Christine Guillemot and Luce

Morin, INRIA, Rennes, France. It has been partly funded by the European Com-

mission in the context of the network of excellence SIMILAR and of the IST-

Discover project. The author is thankful to the IST development team for its

original work on the IST-WZ codec [98] and the Discover software team for the

improvements they brought.

110

CHAPTER 6

CONCLUSION

In this dissertation, we have studied issues related to free-view 3D-video, and in

particular issues of 3D scene reconstruction, compression, and rendering. We have

presented four main contributions. First, we have presented a novel algorithm

which performs surface reconstruction from planar arrays of cameras and gener-

ates dense depth maps with multiple values per pixel. Second, we have introduced

a novel codec for the static depth-image-based representation, which jointly es-

timates and encodes the unknown depth map from multiple views using a novel

rate-distortion optimization scheme. Third, we have proposed a second novel codec

for the static depth-image-based representation, which relies on a shape-adaptive

wavelet transform and an explicit representation of the locations of major depth

edges to achieve major rate-distortion gains. Finally, we have proposed a novel al-

gorithm to extract the side information in the context of distributed video coding

of 3D scenes.

Several issues remain to be solved to obtain a telepresence system offering at

the same time realism, stereopsis, mobility, and interactivity. The efficient com-

pression of the depth maps with multiple values per pixels obtained in Chapter 2 is

still an open problem, which might be addressed by an extension of image-coding

techniques based on geometric wavelets [112]. Moreover, little is still known on

the generalization of 3D reconstruction and compression from static to dynamic

scenes. Finally, efficient approximations of the algorithms proposed here also need

to be designed to meet the constraints of real-time applications.

111

APPENDIX A

FIXING THE PROJECTIVEBASIS

During the nonlinear optimization of projection matrix P(1), the projective basis

is fixed by setting P(0) = [I 0] and choosing two points {X(1), X(2)} and their

projections. We would like to obtain a minimum parameterization of P(1). The

two points induce six constraints on P(1), four of which are independent. Each

point is associated with an equation of the form λ(1)x(1) = P(1)X. Using the third

component to solve for λ(1), we obtain x(1)P(1)3: X = P

(1)1: X and y(1)P

(1)3: X = P

(1)2: X.

These equations can be rewritten as AP(1)s = 0 where A is defined as

A ,

X⊺(1) 0 −x(1)(1)X(1)

0 X⊺(1) −y(1)(1)X(1)

X⊺(2) 0 −x(1)(2)X(2)

0 X⊺(2) −y(1)(2)X(2)

. (A.1)

Taking the SVD of A gives

A = U

S 0

0 0

V⊺

W⊺

, (A.2)

where S,V, and W are three matrices. Therefore, the projection matrices P(1)

can be parameterized by a vector r such that P(1)s =√

3Wr, where the factor√

3

was introduced so that a unit-norm vector r corresponds to ‖P(1)‖22 = 3.

112

APPENDIX B

BUNDLE ADJUSTMENT

The bundle adjustment problem given by Equation 5.5 is solved using an alternated

reweighted linear least square approach [103]. First, the denominators are factored

out and treated as constant weights, only updated at the end of each iteration.

These weights, denoted κ(1)(i), are defined as

κ(1)(i) ,

[

λ(0)(i) x(0)⊺(i) 1]

P(1)s9:12

−1

(B.1)

and initialized to 1. The problem then becomes biquadratic in its parameters:

min{λ(0)(i)},P(1)s

i

κ2(i)(

(

[

λ(0)(i) x(0)⊺(i) 1]

(

x(1)(i)P(1)s9:12 −P

(1)s1:4

))2

+(

[

λ(0)(i) x(0)⊺(i) 1]

(

y(1)(i)P(1)s9:12 −P

(1)s5:8

))2 )

such that

P(1)s =√

3Wr

‖r‖22 = 1

(B.2)

which is solved by alternatively fixing either the projective depths {λ(0)} or the

camera parameters P(1)s and minimizing over the free parameters.

When the projective depths {λ(0)} are fixed, the problem is equivalent to finding

the unit-norm vector r that minimizes the squared norm of Ar, where matrix A is

113

obtained by stacking together sub-matrices of the form

√3κ

−λ(0)x(0)⊺ −1 0 0 x(1)λ(0)x(0)⊺ x(1)

0 0 −λ(0)x(0)⊺ −1 y(1)λ(0)x(0)⊺ y(1)

W. (B.3)

The solution is obtained by taking the SVD of matrix A and choosing the vector

associated with the smallest singular-value.

When the camera parameters P(1)s are fixed, the problem is unconstrained and

its Hessian is diagonal. Taking the derivative with regard to a particular {λ(0)}

and setting it to 0 leads to the solution

λ(0) = −a⊺b

a⊺awhere

a ,

x(1)P(1)s ⊺

9:11 −P(1)s ⊺

1:3

y(1)P(1)s ⊺

9:11 −P(1)s ⊺

5:7

x(0) ,

b ,

x(1)P(1)s12 −P

(1)s4

y(1)P(1)s12 −P

(1)s8

.

(B.4)

114

APPENDIX C

PUBLICATIONS

C.1 Journals

1. M. Maitre, Y. Shinagawa, and M. N. Do, “Joint estimation and encoding

of scalable depth-image-based representations for free-viewpoint rendering,”

IEEE Transactions on Image Processing, submitted for publication.

2. M. Maitre, C. Guillemot, and L. Morin, “3D model-based frame interpolation

for distributed video coding,” IEEE Transactions on Image Processing, vol.

16, pp. 1246-1257, 2007.

3. G. Guetat, M. Maitre, L. Joly, S.-L. Lai, T. Lee, and Y. Shinagawa, “Au-

tomatic 3-D grayscale volume matching and shape analysis,” IEEE Trans-

actions on Information Technology in Biomedicine, vol. 10, pp. 362-376,

2006.

C.2 Conferences

1. M. Maitre and M. N. Do, “Joint encoding of the depth image based represen-

tation using shape-adaptive wavelets,” in ICIP, submitted for publication.

2. M. Maitre, Y. Shinagawa, and M. N. Do, “Symmetric multi-view stereo re-

construction from planar camera arrays,” in CVPR, submitted for publica-

tion.

115

3. Y. Chen, M. Maitre, and F. Tong, “Transparent layer separation for dual

energy imaging,” in ICIP, submitted for publication.

4. M. Maitre, Y. Chen, and F. Tong, “High-dynamic-range compression using

a fast multiscale optimization,” in ICASSP, to appear.

5. M. Maitre, Y. Shinagawa, and M. N. Do, “Rate-distortion optimal depth

maps in the wavelet domain for free-viewpoint rendering,” in ICIP, 2007.

6. M. Maitre, C. Guillemot, and L. Morin, “3D scene modeling for distributed

video coding,” in ICIP, 2006.

C.3 Research Reports

1. Y. Chen, M. Maitre, and F. Tong, “High-dynamic-range compression us-

ing a fast mean-field gradient remapping with artifact correction,” Siemens

Corporate Research Tech. Rep., 2007.

2. M. Maitre, C. Guillemot, and L. Morin, “Reliable optical-flow estimation for

distributed video coding,” Irisa Res. Rep., 2007.

3. M. Maitre, V. Kindratenko, and Y. Shinagawa, “Laser-assisted 3D scene

reconstruction technique,” NCSA Tech. Rep., 2005.

4. M. Maitre, V. Kindratenko, and Y. Shinagawa, “Passive 3D scene recon-

struction technique,” NCSA Tech. Rep., 2005.

116

REFERENCES

[1] F. Herbert, Dune. Philadelphia, PA: Chilton Books, 1965.

[2] Jamiroquai, Travelling Without Moving. Sony, B000002BSG, 1997.

[3] J. Battelle, “HP’s HALO: Now this is telepresence,” May 2007. [Online].Available: http://battellemedia.com/archives/003632.php.

[4] “Wired travel guide: Second life,” 2006. [Online]. Available: http://www.wired.com/wired/archive/14.10/sloverview.html.

[5] M. Kanellos, “Philips: 3D-TV to appear in 2008,” 2006. [Online].Available: http://news.com.com/Philips+3D+TV+to+appear+in+2008/2100-1041 3-6022254.html.

[6] NVIDIA, “3D stereo,” NVIDIA, Santa Clara, CA, Technical Brief TB-00252-001 v02, 2003.

[7] T. Kanade, P. Rander, and P. Narayanan, “Virtualized reality: Constructingvirtual world from real scenes,” IEEE Multimedia, vol. 4, no. 1, pp. 34–46,1997.

[8] C. Zitnick, S. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High-quality video view interpolation using a layered representation,” in Proc.SIGGRAPH, 2004, pp. 600–608.

[9] M. Magnor, P. Ramanathan, and B. Girod, “Multi-view coding for image-based rendering using 3-D scene geometry,” IEEE Trans. on Circuits andSys. for Video Tech., vol. 13, no. 11, pp. 1092–1106, 2003.

[10] A. Smolic and P. Kauff, “Interactive 3-D video representation and coding,”Proc. of the IEEE, vol. 93, no. 1, pp. 98–110, 2005.

[11] E. Adelson and J. Bergen, “The plenoptic function and the elements of earlyvision,” in Computational Models of Visual Processing, M. S. Landy andJ. A. Movshon, Eds. Cambridge, MA: MIT Press, 1991, pp. 3–20.

[12] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach.Lebanon, IN: Prentice Hall, 2002.

117

[13] B. K. P. Horn, Robot Vision. Cambridge, MA: MIT Press, 1986.

[14] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision.Cambridge, UK: Cambridge University Press, 2000.

[15] O. Faugeras and Q.-T. Luong, The Geometry of Multiple Images.Cambridge, MA: MIT Press, 2001.

[16] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An invitation to 3D vision.New York, NY: Springer-Verlag, 2004.

[17] T. Cover and J. Thomas, Elements of Information Theory. Hoboken, NJ:John Wiley and Sons Ltd., 1991.

[18] Mallat, A Wavelet Tour of Signal Processing. Burlington, MA: AcademicPress, 1999.

[19] D. Taubman and M. Marcellin, JPEG2000: Image Compression Fundamen-tals, Standards and Practice. New York, NY: Springer-Verlag, 2001.

[20] Y. Q. Shi and H. Sun, Image and Video Compression for Multimedia Engi-neering. Boca Raton, FL: CRC Press, 1999.

[21] V. Bhaskaran and K. Konstantinides., Image and Video Compression Stan-dards: Algorithms and Architectures. Boston, MA: Kluwer Academic Pub-lishers, 1997.

[22] M. Pharr and G. Humphreys, Physically Based Rendering: From Theory toImplementation. San Fransisco, CA: Morgan Kaufmann, 2004.

[23] G. Wolberg, Digital Image Warping. Piscataway, NJ: IEEE Press, 1994.

[24] H.-Y. Shum, S.-C. Chan, and S. B. Kang, Image-Based Rendering. NewYork, NY: Springer-Verlag, 2007.

[25] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms,” Int. J. of Comp. Vis., vol. 47,no. 1–3, pp. 7–42, 2002.

[26] J. W. Shade, S. J. Gortler, L.-W. He, and R. Szeliski, “Layered depthimages,” in Proc. SIGGRAPH, 1998, pp. 231–242.

[27] X. Gu, S. J. Gortler, and H. Hoppe, “Geometry images,” in Proc. SIG-GRAPH, 2002, pp. 355–361.

[28] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A compari-son and evaluation of multi-view stereo reconstruction algorithms,” in Proc.CVPR, 2006, pp. 519–528.

118

[29] G. Vogiatzis, C. C. H. Esteban, P. H. S. Torr, and R. Cipolla, “Multiviewstereo via volumetric graph-cuts and occlusion robust photo-consistency,”IEEE Trans. on PAMI, vol. 29, pp. 2241–2246, 2007.

[30] J. Sun, Y. Li, S. Kang, and H. Shum, “Symmetric stereo matching for oc-clusion handling,” in Proc. CVPR, 2005, pp. 399–406.

[31] Y. Deng, Q. Yang, X. Lin, and X. Tang, “Stereo correspondence with occlu-sion handling in a symmetric patch-based graph-cuts model,” IEEE Trans.on PAMI, vol. 29, pp. 1068–1079, 2007.

[32] R. I. Hartley, “Theory and practice of projective rectification,” Int. J. ofComp. Vis., vol. 35, pp. 115–127, 1999.

[33] A. Fusiello, E. Trucco, and A. Verri, “A compact algorithm for rectificationof stereo pairs,” Machine Vis. and Appli., vol. 12, pp. 16–22, 2000.

[34] N. Ayache and F. Lustman, “Trinocular stereovision for robotics,” IEEETrans. on PAMI, vol. 13, pp. 73–85, 1991.

[35] C. Sun, “Uncalibrated three-view image rectification,” Im. and Vis. Comp.,vol. 21, no. 3, pp. 259–269, 2003.

[36] Y. Wei and L. Quan, “Asymmetrical occlusion handling using graph cut formulti-view stereo,” in Proc. CVPR, 2005, pp. 902–909.

[37] S. B. Kang, R. Szeliski, and J. Chai, “Handling occlusions in dense multi-view stereo,” in Proc. CVPR, 2001, pp. 103–110.

[38] M.-A. Drouin, M. Trudeau, and S. Roy, “Geo-consistency for wide multi-camera stereo,” in Proc. CVPR, 2005, pp. 351–358.

[39] M. Goesele, B. Curless, and S. M. Seitz, “Multi-view stereo revisited,” inProc. CVPR, 2006, pp. 2402–2409.

[40] P. Gargallo and P. Sturm, “Bayesian 3D modeling from images using multipledepth maps,” in Proc. CVPR, 2005, pp. 885–891.

[41] C. L. Zitnick and S. B. Kang, “Stereo for image-based rendering using imageover-segmentation,” Int. J. of Comp. Vis., vol. 75, pp. 49–65, 2007.

[42] M. Jaesik, M. Powell, and K. W. Bowyer, “Automated performance evalu-ation of range image segmentation algorithms,” IEEE Trans. on Sys., Manand Cyber., vol. 34, pp. 263–271, 2004.

[43] H.-Y. Shum, J. Sun, S. Yamazaki, Y. Li, and C.-K. Tang, “Pop-up light field:An interactive image-based modeling and rendering system,” ACM Trans.on Graphics, vol. 23, pp. 143–162, 2004.

119

[44] C. Zhang and T. Chen, “View-dependent non-uniform sampling for image-based rendering,” in Proc. ICIP, 2004, pp. 2471–2474.

[45] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficient algorithmbased on immersion simulations,” IEEE Trans. on PAMI, vol. 13, pp. 583–598, 1991.

[46] V. Kolmogorov and R. Zabih, “Multi-camera scene reconstruction via graphcuts,” in Proc. ECCV, 2002, pp. 8–40.

[47] C. Fehn, R. Barre, and R. S. Pastoor, “Interactive 3-D TV – concepts andkey technologies,” Proc. of the IEEE, vol. 94, no. 3, pp. 524–538, 2006.

[48] W. Matusik and H. Pfister, “3D-TV: A scalable system for real-time ac-quisition, transmission, and autostereoscopic display,” in Proc. SIGGRAPH,2004, pp. 814–824.

[49] H.-Y. Shum, S. B. Kang, and S.-C. Chan, “Survey of image-based represen-tations and compression techniques,” IEEE Trans. on Circuits and Sys. forVideo Tech., vol. 13, pp. 1020–1037, 2003.

[50] S.-C. Chan, K.-T. Ng, Z.-F. Gan, K.-L. Chan, and H.-Y. Shum, “The plenop-tic video,” IEEE Trans. on Circuits and Sys. for Video Tech., vol. 15, no. 12,pp. 1650–1659, 2005.

[51] L. Levkovich-Maslyuk, A. Ignatenko, A. Zhirkov, A. Konushin, I. K. Park,M. Han, and Y. Bayakovski, “Depth image-based representation and com-pression for static and animated 3-D objects,” IEEE Trans. on Circuits andSys. for Video Tech., vol. 14, pp. 1032–1045, 2004.

[52] J. Oh, Y.-S. Choi, R.-H. Park, J. Kim, T. Kim, and H. Jung, “Trinocularstereo sequence coding based on MPEG-2,” IEEE Trans. on Circuits andSys. for Video Tech., vol. 15, no. 3, pp. 425–429, 2005.

[53] R.-S. Wang and Y. Wang, “Multiview video sequence analysis, compression,and virtual viewpoint synthesis,” IEEE Trans. on Circuits and Sys. for VideoTech., vol. 10, pp. 397–410, 2000.

[54] R. Balter, P. Gioia, and L. Morin, “Scalable and efficient coding using 3Dmodeling,” IEEE Trans. on Multimedia, vol. 8, pp. 1147–1155, 2006.

[55] A. Ortega and K. Ramchandran, “Rate distortion methods in image andvideo compression,” IEEE Signal Proc. Mag., vol. 15, pp. 23–50, 1998.

[56] P. Ramanathan and B. Girod, “Rate-distortion analysis for light field codingand streaming,” EURASIP Sig. Proc.: Im. Com., vol. 21, pp. 462–475, 2006.

120

[57] J. Park and H. Park, “A mesh-based disparity representation method forview interpolation and stereo image compression,” IEEE Trans. on Im.Proc., vol. 15, pp. 1751–1762, 2006.

[58] D. P. Bertsekas, Dynamic Programming and Optimal Control. Nashua,NH: Athena Scientific, 2005.

[59] D. Tzovaras and M. G. Strintzis, “Motion and disparity field estimation usingrate-distortion optimization,” IEEE Trans. on Circuits and Sys. for VideoTech., vol. 8, pp. 171–180, 1998.

[60] J. Ellinas and M. Sangriotis, “Stereo video coding based on quad-tree decom-position of B-P frames by motion and disparity interpolation,” IEE Proc.-Vis. Im. Sig. Proc., vol. 152, no. 5, pp. 639–647, 2005.

[61] G. J. Sullivan and R. L. Baker, “Efficient quadtree coding of images andvideo,” IEEE Trans. on Im. Proc., vol. 3, pp. 327–331, 1994.

[62] G. Sullivan and T. Wiegand, “Video compression - from concepts to theH.264/AVC standard,” Proc. of the IEEE, vol. 93, pp. 18–31, 2005.

[63] Y. Yang and S. S. Hemami, “Generalized rate-distortion optimization formotion-compensated video coders,” IEEE Trans. on Circuits and Sys. forVideo Tech., vol. 10, pp. 942–955, 2000.

[64] G. M. Schuster and A. K. Katsaggelos, “An optimal quadtree-based motionestimation and motion-compensated interpolation scheme for video compres-sion,” IEEE Trans. on Im. Proc., vol. 7, pp. 1505–1523, 1998.

[65] M. Woo, J. Neider, T. Davis, and D. Shreiner, OpenGL Programming Guide.Lebanon, IN: Addison Wesley, 1999.

[66] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical Recipesin C : The Art of Scientific Computing. Cambridge, UK: CambridgeUniversity Press, 1993.

[67] G. Sullivan and T. Wiegand, “Rate-distortion optimization for video com-pression,” IEEE Signal Proc. Mag., pp. 74–90, 1998.

[68] D. Luenberger, Linear and Nonlinear Programming. Boston, MA: KluwerAcademic Publishers, 2003.

[69] P. J. Burt and E. H. Adelson, “The Laplacian pyramid as a compact imagecode,” IEEE Trans. on Com., vol. COM-31, no. 4, pp. 532–540, 1983.

[70] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and thesum-product algorithm,” IEEE Trans. on Info. Theory, vol. 47, pp. 498–519,2001.

121

[71] P. Felzenszwalb and D. Huttenlocher, “Efficient belief propagation for earlyvision,” in Proc. CVPR, 2004, pp. 41–54.

[72] J. D. Oh and R.-H. Park, “Reconstruction of intermediate views from stereo-scopic images using disparity vectors estimated by the geometrical con-straint,” IEEE Trans. on Circuits and Sys. for Video Tech., vol. 16, pp.638–641, 2006.

[73] J. E. Fowler, “QccPack: An open-source software library for quantization,compression, and coding,” in Proc. of SPIE Appli. of Digital Im. Proc., 2000,pp. 294–301.

[74] Y. Morvan, D. Farin, and P. H. N. de With, “Depth-image compressionbased on an R-D optimized quadtree decomposition for the transmission ofmultiview images,” in Proc. ICIP, 2007, pp. 105–108.

[75] S. Li and W. Li, “Shape-adaptive discrete wavelet transforms for arbitrarilyshaped visual object coding,” IEEE Trans. on Circuits and Sys. for VideoTech., vol. 10, pp. 725–743, 2000.

[76] Y. K. Liu and B. Zalik, “An efficient chain code with Huffman coding,”Pattern Recog., vol. 38, pp. 553–557, 2005.

[77] M. Maitre, Y. Shinagawa, and M. N. Do, “Wavelet-based joint estimation andencoding of depth-image-based representations for free-viewpoint rendering,”IEEE Trans. on Im. Proc., submitted for publication.

[78] J. Slepian and J. Wolf, “Noiseless coding of correlated information sources,”IEEE Trans. on Info. Theory, vol. 19, no. 4, pp. 471–480, 1973.

[79] A. Wyner and J. Ziv, “The rate-distortion function for source coding withside information at the decoder,” IEEE Trans. on Info. Theory, vol. 22, no. 1,pp. 1–10, January 1976.

[80] S. Pradhan and K. Ramchandran, “Distributed source coding using syn-dromes (DISCUS): Design and construction,” IEEE Trans. on Info. Theory,vol. 49, no. 3, pp. 626–643, March 2003.

[81] A. Aaron and B. Girod, “Compression with side information using turbocodes,” in Proc. IEEE Int. Data Compression Conf., 2002, pp. 252–261.

[82] J. Garcia-Frias and Y. Zhao, “Compression of correlated binary sources usingturbo codes,” IEEE Comm. Letters, vol. 5, pp. 417–419, 2001.

[83] J. Garcia-Frias and Y. Zhao, “Data compression of unknown single and cor-related binary sources using punctured turbo codes,” in Proc. Allerton Conf.on Com., Contr. and Comput., 2001.

122

[84] J. Bajcsy and P. Mitran, “Coding for the Slepian-Wolf problem with turbocodes,” in Proc. IEEE Int. Global Com. Conf., 2001, pp. 1400–1404.

[85] A. D. Liveris, Z. Xiong, and C. N. Georghiades, “Compression of binarysources with side information at the decoder using LDPC codes,” IEEEComm. Letters, vol. 6, pp. 440–442, 2002.

[86] T. Tian, J. Garcia-Frias, and W. Zhong, “Compression of correlated sourcesusing LDPC codes,” in Proc. IEEE Int. Data Compression Conf., 2003, p.450.

[87] A. Aaron, R. Zhang, and B. Girod, “Wyner-Ziv coding of motion video,” inProc. Asilomar Conf. on Sig., Sys. and Comp., 2002, pp. 240–244.

[88] A. Aaron, S. Rane, R. Zhang, and B. Girod, “Wyner-Ziv coding for video:Applications to compression and error resilience,” in Proc. IEEE Int. DataCompression Conf., 2003, pp. 93–102.

[89] A. Aaron, S. Rane, E. Setton, and B. Girod, “Transform-domain Wyner-Zivcodec for video,” in Proc. SPIE Conf. on Visual Com. and Im. Proc., 2004,pp. 520–528.

[90] R. Puri and K. Ramchandran, “PRISM: A new robust video coding archi-tecture based on distributed compression principles,” in Proc. Allerton Conf.on Com., Contr. and Comput., 2002.

[91] R. Puri and K. Ramchandran, “PRISM: A new reversed multimedia codingparadigm,” in Proc. ICIP, 2003, pp. 617–620.

[92] R. Puri, A. Majumbar, P. Ishwar, and K. Ramchandran, “Distributed videocoding in wireless sensor networks,” Signal Proc., vol. 23, pp. 94–106, 2006.

[93] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed videocoding,” Proc. of the IEEE, vol. 93, no. 1, pp. 71–83, January 2005.

[94] P. Ishwar, V. Prabhakaran, and K. Ramchandran, “Towards a theory forvideo coding using distributed compression principles,” in Proc. ICIP, 2003,pp. 687–690.

[95] J. Repko and M. Pollefeys, “3D models from extended uncalibrated videosequences: Addressing key-frame selection and projective drift,” in Proc.3-D Digit. Imag. and Model., 2005, pp. 150–157.

[96] F. Galpin and L. Morin, “Sliding adjustment for 3D video representation,”EURASIP J. on Applied Signal Proc., vol. 2002, no. 10, pp. 1088–1101, 2002.

[97] J. Ascenso, C. Brites, and F. Pereira, “Content adaptive Wyner-Ziv videocoding driven by motion activity,” in Proc. ICIP, 2006, pp. 605–608.

123

[98] J. Ascenso, C. Brites, and F. Pereira, “Improving frame interpolation withspatial motion smoothing for pixel domain distributed video coding,” inEURASIP Conf. on Speech and Im. Proc., Multim. Com. and Serv., 2005.

[99] P. Torr and A. Zisserman, “Robust computation and parametrization ofmultiple view relations,” in Proc. ICCV, 1998, pp. 727–732.

[100] O. Chum, J. Matas, and J. Kittler, “Locally optimized RANSAC,” in Proc.of the 25th DAGM Symp., 2003, pp. 236–243.

[101] P. A. Beardsley, A. Zisserman, and D. W. Murray, “Sequential updating ofprojective and affine structure from motion,” Int. J. of Comp. Vis., vol. 23,pp. 235–259, 1997.

[102] B. Matei and P. Meer, “A general method for errors-in-variables problems incomputer vision,” in Proc. CVPR, 2000, pp. 18–25.

[103] A. Bartoli, “A unified framework for quasi-linear bundle adjustment,” inProc. ICPR, 2002, pp. 560–563.

[104] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc.Alvey Vision Conf., 1988, pp. 147–151.

[105] R. Duda, P. Hart, and D. Strok, Pattern Classification. Hoboken, NJ: JohnWiley and Sons Ltd., 2001.

[106] S. Birchfield and C. Tomasi, “Depth discontinuities by pixel-to-pixel stereo,”Int. J. of Comp. Vis., vol. 35, pp. 269–293, 1999.

[107] M. Berg, M. Kreveld, M. Overmars, and O. Schwarzkopf, ComputationalGeometry: Algorithms and Applications. New York, NY: Springer-Verlag,2000.

[108] R. Hartley, “Chirality,” Int. J. of Comp. Vis., vol. 26, no. 1, pp. 41–61, 1998.

[109] R. Hartley, “Lines and points in three views and the trifocal tensor,” Int. J.of Comp. Vis., vol. 22, no. 2, pp. 125–140, 1997.

[110] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK:Cambridge University Press, 2004.

[111] C. Xu and J. L. Prince, “Snakes, shapes, and gradient vector flow,” IEEETrans. on Im. Proc., vol. 7, pp. 359–369, 1998.

[112] G. Peyre and S. Mallat, “Surface compression with geometric bandelets,” inProc. SIGGRAPH, 2005, pp. 601–608.

124

AUTHOR’S BIOGRAPHY

Matthieu Maitre is a Ph.D. candidate at the Department of Electrical and Com-

puter Engineering, University of Illinois at Urbana-Champaign (UIUC). He re-

ceived in 2002 his diplome d’ingenieur from the Ecole Nationale Superieure de

Telecommunications (ENST) and his M.S. from the UIUC. His research interests

include computer vision, image analysis, stereovision, and video compression.

125