C´esar Morais Palomo Interactive image-based rendering for...

Cesar Morais Palomo

Interactive image-based rendering for virtualview synthesis from depth images

DISSERTACAO DE MESTRADO

Dissertation presented to the Postgraduate Program in Infor-matics of the Departamento de Informatica, PUC–Rio as par-tial fulfillment of the requirements for the degree of Mestre emInformatica

Advisor: Prof. Marcelo Gattass

Rio de JaneiroJuly 2009

Cesar Morais Palomo

Interactive image-based rendering for virtualview synthesis from depth images

Dissertation presented to the Postgraduate Program in Infor-matics, of the Departamento de Informatica do Centro TecnicoCientıfico da PUC–Rio, as partial fulfillment of the requirementsfor the degree of Mestre. Approved by the following commission:

Prof. Marcelo Gattass

AdvisorDepartamento de Informatica — PUC–Rio

Prof. Alberto Barbosa Raposo

Departamento de Informatica — PUC-Rio

Prof. Waldemar Celes Filho

Departamento de Informatica — PUC-Rio

Prof. Paulo Cezar Pinto Carvalho

Instituto de Matematica Pura e Aplicada — IMPA

Prof. Jose Eugenio Leal

Coordinator of the Centro Tecnico Cientıfico da PUC–Rio

Rio de Janeiro, July 06, 2009

All rights reserved.

Cesar Morais Palomo

Cesar Morais Palomo graduated from the State University ofCampinas with a BSc. in Computer Science in 2005. From2002 to 2007 he worked as a contractor for several companiesas a software engineer, designing and developing enterpriseand critical systems for the industry and the Brazilian govern-ment. In 2007 he started the graduate program in ComputerScience at PUC-Rio as a master candidate. In the same year,he joined the Computer Graphics Technology Group (Tec-graf), where he has been developing computer systems in thefields of computer graphics visualization, computer vision, vir-tual and augmented reality.

Bibliographic data

Palomo, C. M.

Interactive image-based rendering for virtual view syn-thesis from depth images / Cesar Morais Palomo; advisor:Marcelo Gattass. — 2009.

52 f: il. ; 29,7 cm

Dissertacao(Mestrado em Informatica) - Pontifıcia Uni-versidade Catolica do Rio de Janeiro, Rio de Janeiro, 2009.

Inclui bibliografia.

1. Informatica– Teses. 2. Renderizacao Baseada em Im-agens. 3. Composicao. 4. Programacao em placas graficas.5. Mapa de profundidade. I. Gattass, M.. II. Pontifıcia Uni-versidade Catolica do Rio de Janeiro. Departamento de In-formatica. III. Tıtulo.

CDD: 004

Acknowledgments

To my advisor Professor Marcelo Gattass for the support, the invaluable

attention and the incentive for the realization of this work.

To Ivana, whose grace and encouragement during difficult times were

essential for the completion of this work.

To my mother and father, my brother and sister.

To my nephews Nicolas and Felipe for bringing me so much joy every

moment I spend with them.

To my colleagues at TecGraf, who helped me whenever a doubt arised.

To the people of the Computer Science department for their assistance.

To CNPq and PUC–Rio for the financial support.

Abstract

Palomo, C. M.; Gattass, M.. Interactive image-based render-ing for virtual view synthesis from depth images. Rio deJaneiro, 2009. 52p. MSc. Dissertation — Departamento de In-formatica, Pontifıcia Universidade Catolica do Rio de Janeiro.

Image-based modeling and rendering has been a very active research topic

as a powerful alternative to traditional geometry-based techniques for image

synthesis. In this area, computer vision algorithms are used to process and

interpret real-world photos or videos in order to build a model of a scene,

while computer graphics techniques use this model to create photorealistic

images based on the captured photographs or videos.

The purpose of this work is to investigate rendering techniques capable of

delivering visually accurate virtual views of a scene in real-time.

Even though this work is mainly focused on the rendering task, without the

reconstruction of the depth map, it implicitly overcomes common errors in

depth estimation, yielding virtual views with an acceptable level of realism.

Tests with publicly available datasets are also presented to validate our

framework and to illustrate some limitations in the IBR general approach.

Keywords

Image-based rendering. Blending. GPU programming. Depth im-

ages.

Resumo

Palomo, C. M.; Gattass, M.. Renderizacao interativa baseadaem imagens para sıntese de vistas virtuais a partir de ma-pas de cor e profundidade. Rio de Janeiro, 2009. 52p. Dis-sertacao de Mestrado — Departamento de Informatica, PontifıciaUniversidade Catolica do Rio de Janeiro.

Modelagem e renderizacao baseadas em imagem tem sido uma area de

pesquisa muito ativa nas ultimas decadas, tendo recebido grande atencao

como uma alternativa as tecnicas tradicionais de sıntese de imagens

baseadas primariamente em geometria. Nesta area, algoritmos de visao com-

putacional sao usados para processar e interpretar fotos ou vıdeos do mundo

real a fim de construir um modelo representativo de uma cena, ao passo que

tecnicas de computacao grafica sao usadas para tomar proveito desta rep-

resentacao e criar cenas foto-realistas.

O proposito deste trabalho e investigar tecnicas de renderizacao capazes de

gerar vistas virtuais de alta qualidade de uma cena, em tempo real. Para

garantir a performance interativa do algoritmo, alem de aplicar otimizacoes

a metodos de renderizacao existentes, fazemos uso intenso da GPU para o

processamento de geometria e das imagens para gerar as imagens finais.

Apesar do foco deste trabalho ser a renderizacao, sem reconstruir o mapa

de profundidade a partir das fotos, ele implicitamente contorna possıveis

problemas na estimativa da profundidade para que as cenas virtuais geradas

apresentem um nıvel aceitavel de realismo.

Testes com dados publicos sao apresentados para validar o metodo pro-

posto e para ilustrar deficiencias dos metodos de renderizacao baseados em

imagem em geral.

Palavras–chave

Renderizacao Baseada em Imagens. Composicao. Programacao em

placas graficas. Mapa de profundidade.

Contents

1 Introduction 11

2 Related Work 142.1 IBR real applications 142.2 Static scenes 162.3 Dynamic scenes 18

3 Rendering depth images 223.1 Image acquisition process 223.2 Depth image representation 243.3 3D warping 253.4 Artifacts inherent to 3D warping 26

4 Compositing 294.1 Virtual camera navigation 294.2 Blending algorithm 314.3 Limitations 32

5 IBR on the GPU 355.1 Conceptual algorithm 355.2 Creation of view-dependent geometry 365.3 Warping to novel viewpoint and occlusions identification 375.4 Compositing on the GPU 38

6 Results 426.1 Rendering quality 426.2 Limitations 456.3 Time-performance analysis 466.4 Summary 47

7 Conclusion and Future works 48

Bibliography 49

List of Figures

1.1 Skin and hair rendering are examples of challenges to purelygeometry-based approaches. 11

1.2 Plenoptic function. 12

2.1 Bullet-time effect shows the necessity of counterbalancing thenumber of input cameras and quality of rendered images. 15

2.2 IBR goals: establish mapping between representation and imagescreen, and blend. 16

2.3 Layered Depth Images [30]. Input images (left) used to generatethe layered representation of a scene (top right). It allows forreconstruction of views free from disocclusion problems (bottom). 17

2.4 View-dependent texture mapping [11]. Input images are projectedonto reconstructed architectural model, and assembled to form acomposite rendering. Top two pictures show images projected ontomodel, lower left shows results of blending those two renderings,and lower right shows final result of blending a total of 12 originalimages. 18

2.5 Kanade et al’s Virtualized Reality geodesic dome [17]. 192.6 Goldlucke results [13]. The regular triangular mesh causes inaccu-

rate appearance at the vicinity of depth discontinuities. 202.7 Camera setup in Zitnick et al [38]. Eight cameras are used to

capture 1024x768 images, synchronized with commissioned PtGreyconcentrator units. 21

2.8 Rendering results for Zitnick et al [38]: (a) main layer M fromone view rendered, with depth discontinuities erased; (b) boundarylayer B rendered; (c) main layer M for other view rendered; (d) finalblended result. 21

3.1 Imaging process in the pinhole camera model: image formationis a sequence of transformations between coordinate systems. Weignore here radial distortion. 22

3.2 Example of a depth image: color image + dense depth map. Darkerpixels in depth map mean greater depth. Courtesy of Zitnick et al[38]. 24

3.3 3D warping process. First, 3D mesh generated using depth mapfrom reference camera Ci is unprojected to global coordinatesystem. Then, mesh is projected into a virtual view using cameraCvirtual projection matrix. 25

3.4 3D warping result. Depth image (pair of images to the left) froma reference camera is projected into a virtual view (right image)using the described 3D warping process. In this case, virtual cameraCvirtual was placed slightly to the right of reference camera Ci. 27

3.5 3D warping artifacts due to discontinuity in depth. The continuityassumption in depth map does not hold at objects’ boundaries,which causes undesirable artifacts in the form of stretched trianglesin warped view. When depth map (left) is warped to new viewpoint,the region to the right of the ballet dancer reveals occluded areas. 27

3.6 3D warping to a new viewpoint, with occluded regions drawn inblack thank to our labeling schema. 28

4.1 Example of cameras setup: input cameras arranged along a 1D arc. 304.2 Navigation of virtual camera Cvirtual restricted to the lines linking

center of adjacent pair of input cameras. 304.3 Virtual camera’s parameters can be determined as a linear interpo-

lation of adjacent pair of cameras. 304.4 Blending algorithm. Two reference views Ci and Ci+1, adjacent to

virtual viewpoint Cvirtual, are used in composition stage. 314.5 Cases considered in the compositing process. 334.6 Frontier r between regions A and B may become undesirably visible

due to the stepped behavior of our compositing algorithm and thereexist photometric (e.g. gain) differences in cameras used for capture. 33

4.7 Visible seams between occlusion and blended areas, to the rightand to the left of woman’s head. 34

5.1 Conceptual algorithm for novel view synthesis. 355.2 Vertex-buffer object used for improving performance during mesh

generation. 375.3 Reference view rendering into a FBO, using vertex and fragment

shaders. 385.4 Influence of angular distance in reference camera’s contribution

during compositing. 395.5 Penalty for pixels marked occluded based on angular distance. 40

6.1 Sample data from Ballet (left) and Breakdancers(right) sequences:color images on top, and corresponding depth images below [38]. 43

6.2 Synthesized images for one frame of ballet sequence. Left and mid-dle columns respectively correspond to cameras’ 5 and 4 warpingresults, and right column is the final result. First row: occlusion ar-eas not identified and rubber sheets appear. Second row: we applythe proposed labeling approach. 43

6.3 Synthesized images for one frame of breakdancers sequence. Leftand middle columns respectively correspond to cameras’ 3 and 2warping results, and right column is the final result. First row:occlusion areas not identified and rubber sheets appear. Secondrow: we apply the proposed labeling approach. 43

6.4 Virtual camera positioned coincidently with camera’s 5 position inballet sequence, and cameras 5 and 6 used as reference cameras.Left column: original image. Middle column: synthetic image. Rightcolumn: differences between real and synthetic images. Negated forease of printing: equal pixels in white. 44

Interactive image-based rendering for virtual view synthesis from depth images 10

6.5 Virtual camera positioned coincidently with camera’s 5 position inballet sequence, and cameras 5 and 6 used as reference cameras.Left column: original image. Middle column: synthetic image. Rightcolumn: differences between real and synthetic images. Negated forease of printing: equal pixels in white. 44

6.6 Close-up of rendering result for frame 48 in ballet sequence. Leftcolumn: original photo from camera 7. Middle column: estimateddepth map. Right column: visible seams below dancer caused bywrong depth estimates. 45

6.7 Close-up of rendering result for frame 88 in breakdancers sequence.Left column: original photo from camera 7. Middle column: esti-mated depth map. Right column: visible seams close to dancer’sfoot caused by wrong depth estimates. 45

6.8 Close-ups of rendering artifacts (shadowing) for frame 48 in balletsequence and frame 88 in breakdancers sequence. 46

6.9 Close-ups of rendering artifacts (due to matting absence) for frame48 in ballet sequence and frame 88 in breakdancers sequence. 46

6.10 Render time X input images’ number of pixels. 47

1

Introduction

One of the primary goals in Computer Graphics (CG) is photorealistic

rendering. CG tries to solve a well-defined problem: given the geometry,

material, lighting and shading information for a virtual scene, create an image

that looks as close as possible to one that a camera would capture of a real

version of the described scene. However, despite all the advancements in more

classical areas of CG, it is still hard to compete with images of real scenes.

This limitation on quality is inherent to geometry-based rendering pro-

cesses. Even when high quality rendering techniques, such as ray tracing, are

used to mimic the real world physics involved in object’s illumination interac-

tion, still the synthesized image’s quality is limited to the model description

and the approximations of physical models describing light reflection and trans-

mission. For instance, realistic rendering of hair and skin remains a challenge

to CG’s community until nowadays.

Image-based rendering (IBR) is a powerful alternative to traditional

geometry-based techniques for image synthesis. The main idea is to use

images rather than geometry as the main primitives for rendering novel views.

Computer vision (CV) algorithms are used to extract a model from existing

images and videos, a process called image-based modeling. Model and images

1.1(a): Skin rendering ex-ample (image courtesy ofCraig Donner and HenrikWann Jensen [12].

1.1(b): Hair rendering ex-ample [25].

Figure 1.1: Skin and hair rendering are examples of challenges to purelygeometry-based approaches.


Figure 1.2: Plenoptic function.

then work as input for rendering methods that can take advantage of real-world

samples of the scene’s radiance and lighting properties, potentially leveraging

synthesized images higher visual accuracy.

Ideally, the obtained model describing a scene would be equivalent to the

plenopticfunction [3]:P7(θ, φ, λ, t, Vx, Vy, Vz) (1-1)

To measure this 7D function, one can imagine placing a pinhole camera’s

center at every 3D location (Vx, Vy, Vz) at every possible angle (θ, φ), for

every wavelength λ, at every time instant t. Indeed, image-based modeling

and rendering can be defined as a means of sampling the plenoptic function,

representing it in a compact and useful manner, compressing all this data and

rendering novel views from it. In other words, IBR can be viewed as a set of

techniques to reconstruct a continuous representation of the plenoptic function

using observed samples as input.

The generation of novel views from acquired images is motivated by

several applications in computer games, sports broadcast, TV advertising,

cinema and entertainment industry. In case of ambiguity in a soccer game,

for instance, many input views may be used to synthesize a new view at a

different angle to help referees inspect for events such as fouls or offsides.

The goal of this work is to develop a method of rendering synthetic

novel views of a scene captured by real cameras, capable of generating visually

accurate images at interactive rendering rates. Depth images, i.e. color images

along with their dense depth maps, are used as the sole input of our algorithm.

Results demonstrate the efficiency and quality of the proposed system.

In short, our method has the following characteristics:

– Real-time performance for virtual view synthesis.

– Visually-accurate synthesized views: rendered images have quality com-

parable to the input photos, with few visible artifacts.


– Transitions between views are smooth: visible changes when changing

from one viewpoint to another are not easily noticeable.

The main contribution of this work is an IBR method running entirely

on the GPU. That not only guarantees good performance, but also leaves the

CPU free to perform other tasks like input video decoding, for instance. An

additional contribution is that our method depends solely on depth images as

input, without any pre-processing stage of the input.

This document is organized as follows. Chapter 2 presents a review of

related research in IBR, delineating reasons for our design choices. Chapters 3

and 4 present the basics on depth image representation and how images can

be composited for virtual synthesis. Chapter 5 describes how the proposed

method modifies existing techniques to better suit rendering at the GPU.

Results and performance numbers are depicted in Chapter 6, where publicly

available datasets are used to test our algorithm. Finally, in Chapter 7 we

conclude and present future work directions to further improve and extend

our architecture.

2

Related Work

In this chapter we present a brief history of IBR and review relevant

research results for free viewpoint both in static and dynamic scenes. These

works present different strategies for model acquisition, representation and

rendering. However, the review is focused especially in model representations

and corresponding rendering techniques, since our main goal is visually-

accurate rendering at interactive frame rates. We start with real applications

that greatly helped spur IBR research.

2.1

IBR real applications

Maybe one of the most popular uses of IBR was the Bullet Time Effect in

1999 movie The Matrix by Warner Bros [1]. The technique used still cameras

surrounding an object in a predefined array, forming a complex curve in space,

triggered sequentially or simultaneously. Then, singular frames taken from each

of the still cameras were arranged and displayed consecutively to produce an

orbiting viewpoint of an action frozen in time or in hyper-slow-motion.

Although the technique used in The Matrix, in theory, allowed for

limitless perspectives and variable display frame rates with a virtual camera,

those perspectives were limited to the predefined camera paths. Besides, many

input cameras and man-hours were necessary to make the virtual camera fly-

through smooth and realistic.

But it was more than one decade before The Matrix, in the early 1980’s,

that the freeze frame effect was first demonstrated by Tim Macmillan’s Time-

slice [20]. An earlier version consisted of 360 pinhole film cameras arranged in a

circle looking towards the center of a circle, where the subject was positioned.

Filming was done in the dark, using a flash. A later version reduced the number

of cameras to 120, covering 90◦.

Another similar approach was used by Dayton Taylor’s Timetrack system

to produce commercials in 1995 [34]: the illusion of moving through a frozen

slice of time was produced by rapidly jumping between different still cameras

arranged along a path, just like it would be done some years later in The


2.1(a): Input cameras arranged in a predefinedarray.

2.1(b): Frozen-time frame ina fly-through around thecharacter.

Figure 2.1: Bullet-time effect shows the necessity of counterbalancing thenumber of input cameras and quality of rendered images.

Matrix. Also in 1995, Michel Gondry’s ”Like a rolling Stone” music video

clip innovated by using morphing between adjacent cameras rather than just

jumping from one to another.

The freeze frame/bullet time effect attracted the interest from the re-

search community. But earliest works in IBR focused unsurprisingly on deal-

ing with static scenes. Pioneering works include Chen and Williams’ View

Interpolation [8], Chen’s QuickTime VR [7], McMillan and Bishop’s Plenoptic

Modeling [23], Levoy and Hanrahan’s Light Field Rendering [19], Gortler et

al’s Lumigraph [14].

An even more promising application of the method is Free viewpoint

TV (FTV) [33]: multi-view video and multi-view depth would be broadcasted,

allowing for a free viewpoint experience to the final spectator. Since December

2001 MPEG has been working on the exploration of 3D Audio-Visual (3DAV)

technology, and since then has received strong support from TV industry

organizations for FTV standardization.

Those works may differ in the number of image samples necessary for

obtaining good rendering results, in their representation of the scene, and in

the rendering algorithm itself. However, all of them share the general goals

of IBR depicted in Figure 2.2: create a representation linked to images of the

acquired scene, and composite views to create a new one.

Although early image-based representations that are based solely on

image samples, like Light Field Rendering and Panoramas, require very simple

rendering techniques, a great number of input samples are necessary. Later on,

more sophisticated representations were proposed to deal with the trade-off

between images and geometry, and rendering techniques changed accordingly.


Figure 2.2: IBR goals: establish mapping between representation and imagescreen, and blend.

2.2

Static scenes

By removing time t and light wavelength λ, in 1995 McMillan and Bishop

[23] introduced the concept of Plenoptic Modeling, with the 5D version of the

plenoptic function P5(Vx, Vy, Vz, θ, φ). An even simpler representation is the

2D panorama, where the viewpoint is fixed (P2(θ, φ)). It can be cylindrical, as

in 1995 Chen’s Quicktime VR [7], or spherical, as in 1997 Szeliski and Shum’s

work [32].

Levoy and Hanrahan’s 1996 Light field rendering system [19] constrains

the plenoptic function to a bounding box, thus representing it as a 4-dimension

function. Rays are interpolated assuming that the scene surface is close to a

focal plane. Objects surfaces located far away from the focal plane appear

blurred at interpolated views.

Lumigraph system [14], proposed in 1996, uses a similar rendering

method, also restricted to a bounding box. However, rather than Light Field’s

unique focal plane, it uses an approximation of 3D object surface to reduce

the blur problem. Still, a huge number of input images are necessary for high-

quality rendering.

Chen and Williams’ 1993 View Interpolation method [8] makes use of

implicit geometry to reconstruct arbitrary viewpoints given two input images

and dense optical flow between them. The method works well when input views

are close by. Otherwise, the overlapping parts may become too small, impairing

the dense optical flow computation.

Also using implicit geometry, Seitz and Dyers’s 1996 View Morphing

technique [29] reconstructs any viewpoint on the line that links two optical

centers of the original cameras. Intermediate views are exactly linear combi-

nations of two views given that the camera motion is perpendicular to the

camera viewing direction.

The aforementioned works either require a large number of images for

rendering (methods that do not rely on geometry) or require very accurate

image registration (methods that use implicit geometry) for high-quality


Figure 2.3: Layered Depth Images [30]. Input images (left) used to generatethe layered representation of a scene (top right). It allows for reconstructionof views free from disocclusion problems (bottom).

virtual synthesis. Those limitations can be overcome through the usage of

explicit 3D information, encoded either in the form of 3D coordinates or depth

along lines-of-sight.

In 1999 McMillan [22] argue that 3D warping techniques can be used to

render new viewpoints when depth information is available for every point in

images. This is accomplished by unprojecting pixels of the original images to

their proper 3D locations, and subsequently reprojecting them onto the new

viewpoint. The side-effect of that method is the appearance of holes in the

warped image.

Difference of sampling resolution (as in the case of zooming-in) or

disocclusions, i.e. depth discontinuities, are the causes of holes generation.

Splatting [15] has proved to be enough to fill holes introduced by sampling

differences, but it cannot deal with disocclusions.

Shade et al’s 1998 Layered Depth Images (LDIs) [30], proposed the

storage of depth information not only for what is visible in the input image,

but also for everything behind the visible surface. In other words, each pixel in

the input image contains a list of depth and color values. The correct position

in that list could be retrieved and used accordingly depending on the new

viewpoint’s position. This layered representation can be seen in Figure 2.3.

Another use of explicit geometry in IBR is View-dependent texture-

mapping (VDTM), proposed in 1996 by Debevec et al [11], depicted in Figure

2.4. It consists in texture-mapping 3D models of a reconstructed architecture

environment, through warping and blending of several input images of that

environment. The technique was later improved by Debevec et al [10], in 1998,

to reduce computational cost and to allow for smooth blending. The main

advantage of that approach is the usage of projective texture mapping, which

boosts performance through the usage of graphics hardware.

Regarding the composition process, the Unstructured Lumigraph [5],


Figure 2.4: View-dependent texture mapping [11]. Input images are projectedonto reconstructed architectural model, and assembled to form a compositerendering. Top two pictures show images projected onto model, lower left showsresults of blending those two renderings, and lower right shows final result ofblending a total of 12 original images.

proposed in 2001 by Buehler et al, presents a very detailed analysis of how

textures can be blended based on relative angular position, resolution, and

field-of-view. It is a valuable reference for more principled and visually-accurate

composition.

Finally, Kang and Szeliski [18] introduced in 2004 the idea of not only

using view-dependent textures, but also view-dependent geometries for dealing

with non-Lambertian surfaces properties. Warped depth images are blended

to produce new views that resemble original non-rigid effects very effectively.

Further research works have focused on how to handle non-rigid effects,

but works presented in this section have been successfully adapted to deal with

the more intriguing task of rendering dynamic scenes with IBR.

2.3

Dynamic scenes

As mentioned in the previous section, the bullet-time/freeze-frame effect

is a very popular application of IBR for dynamic scenes, and its popularity

helped spur IBR research on the pursuit of free viewpoint in what is called

video-based rendering (VBR) [21].

Extending IBR techniques to dynamic scenes with arbitrary viewpoint

selection while the scene is changing is not trivial, although its application is ex-

tremely attractive. Associated problems are twofold. First, there are hardware-

related issues such as camera synchronization, calibration and images acquisi-

tion and storage. Decreasing costs of hardware and technology improvements

helped make the capture and subsequent processing of dynamic scenes more

practical. Second, it is difficult to achieve automatic generation of seamless in-

terpolation between views for arbitrary scenes. Proposed techniques must deal


Figure 2.5: Kanade et al’s Virtualized Reality geodesic dome [17].

with those difficulties to achieve high-quality rendering at reasonable time.

One of the earliest VBR systems is Kanade et al’s 1997 Virtualized

Reality [17]. Their architecture involved 51 cameras arranged around a 5-meter

geodesic dome, as shown in Figure 2.5. Cameras captured 640x480 video at

30 fps. An important aspect to notice about their work is the two-step video

acquisition: real-time recording and an offline digitization step. Virtualized

Reality computed a dense stereo depth map for each camera, used as view-

dependent geometry for view synthesis. A first version of the system used the

closest reference view as a basis, and other two neighbor cameras for hole

filling, while a second version involved the merging of depth maps into a single

model to be textured with multiple reference views. A version named Eyevision

was successfully used commercially by CBS Television at Super Bowl XXXV

in 2001, with more than 30 cameras involved.

Vedula et al [36] extended Virtualized Reality in 2005 by employing

spatio-temporal view interpolation. It explicitly recovered 3D scene shape at

every time frame and also 3D scene flow (local instantaneous 3D non-rigid

temporal deformation). A voxelization algorithm was used for both 3D shape

extraction and rendering. For novel view generation, ray-casting along with

blending weights were used. Weights were a combination of temporal and

spatial proximity to the novel viewpoint.

Stanford Light Field Camera was proposed initially in 2002 with 6 input

cameras [37]. It was later extended in 2004 to a system with 128 CMOS

cameras [35], designed based on the IEEE 1394 high speed serial bus (Firewire).

Cameras are capable of acquiring 640x480 videos at 30 fps, with 8:1 MPEG

compression.

Goldlucke et al [13] in 2002 used a subset of Stanford Light Field Camera

for acquiring and displaying dynamic scenes. In their work, cameras calibration

is done for extrinsic and intrinsic parameters estimation, to reduce radial

distortion and also to reduce color and brightness variation across cameras.

Depth maps are obtained through depth from stereo. After depth es-


Figure 2.6: Goldlucke results [13]. The regular triangular mesh causes inaccu-rate appearance at the vicinity of depth discontinuities.

timation for all images and timeframes, interactive rendering is achieved by

employing 3D warping. A regular, downsampled triangle mesh is created cov-

ering each of the input depth images. A vertex program is used for warping to

the novel view, and composition of 4 different reference views is done through

weights based on proximity to the novel view: the closer the input image, the

higher its weight.

They report a frame rate of 11 fps with a mesh resolution of 160x120.

Figure 2.6 shows how the triangular mesh superimposed on a reference view’s

depth map. Since the triangular mesh is continuous and regular, at the

vicinity of big depth discontinuities the appearance is usually incorrect. In

fact, the mesh downsampling generates an unpleasant blurring effect at objects

boundaries. A great improvement regarding rendering quality is presented

by Zitnick et al [38]. Although their system is quite modest in size, with

only 8 cameras, higher resolution images (1024x768) are captured at 15 fps.

Photorealism is achieved using a two-layer representation inspired by Layered-

depth images [30], mentioned in the previous section.

Their system calculates a dense depth map for each input color image

with their proposed algorithm. After that, they divide the scene representation

in two layers: boundary layer B, around depth discontinuities, and main layer

M . To generate this representation, a variant of Bayesian matting [9] is used to

automatically estimate foreground and background colors, depths and opacities

around depth discontinuities.

System configuration is shown in Figure 2.7. At rendering time, the two

reference views nearest to the novel view are chosen, warped through usage

of a custom vertex shader into separate buffers, and finally blended through

a custom fragment shader that calculates contribution weights based on an-

gular proximity of the reference view to the novel view. Their system involves

both offline and real-time phases. Computation of depth maps, boundaries

identification and matting in those areas, compression and storage are offline

processes. Decoding and rendering are done in real-time, with reported per-


Figure 2.7: Camera setup in Zitnick et al [38]. Eight cameras are used to capture1024x768 images, synchronized with commissioned PtGrey concentrator units.

Figure 2.8: Rendering results for Zitnick et al [38]: (a) main layer M fromone view rendered, with depth discontinuities erased; (b) boundary layer Brendered; (c) main layer M for other view rendered; (d) final blended result.

formance of 5 fps for 1024x768 images. It yields the best results among all

mentioned VBR systems, with examples of generated views depicted in Figure

2.8.

Our rendering method also relies on 3D warping and blending of a pair

of reference views, but assumes that depth maps are previously calculated: we

focus on the rendering stage, not dealing with depth map estimation.

The objective of this work is to completely avoid offline processes like

matting, but still yield high-quality rendering of virtual views. We intend to

use solely depth images (color image + depth map) as input for our algorithm.

Our contribution is a set of techniques for warping and blending views which

run entirely on the GPU.

3

Rendering depth images

Before describing the proposed rendering method, we present the basic

concepts involved in using depth images for 3D warping and novel view

synthesis. We start with a review on image acquisition process in Section

3.1. In Section 3.2 we detail depth images as a scene’s explicit geometry

representation. In Section 3.3 we explain the 3D warping process in the view-

dependent texture-mapping (VDTM) technique context. Finally, we conclude

in Section 3.4 listing some problems associated to 3D warping and how they

can be identified for further dealing.

3.1

Image acquisition process

When a physical device, i.e. an input camera, is used to acquire a

photograph of a scene, we can simplify the image acquisition process using

a pinhole camera model [16]. In this model, the imaging process is a sequence

Figure 3.1: Imaging process in the pinhole camera model: image formation is asequence of transformations between coordinate systems. We ignore here radialdistortion.


of transforms between different coordinate systems, as depicted in Figure 3.1.

The global coordinate system (or GCS) relates objects in 3D space,

differing their relative position and pose. Points Pw(Xw, Yw, Zw, 1) in GCS

are defined using homogeneous coordinates.

Each input camera contains associated calibration data, which identifies

the camera’s pose and intrinsic properties. It consists of two matrices: view

matrix V4,4 and calibration matrix K3,4 [16].

Matrix V gives the camera’s position (given by a translation vector t)

and pose (given by a rotation matrix R3,3) according to GCS:

V =

r11 r12 r13 t1

r21 r22 r23 t2

r31 r32 r33 t3

0 0 0 1

Matrix K contains camera’s intrinsic properties used for perspective

projection [16]. It consists of field of view (f), aspect ratio (a), skew factor

(s) and image optical center (x0, y0):

K =

f s x0 0

0 af y0 0

0 0 1 0

So, to express a point Pw relative to the camera coordinate system (or

CCS), i.e. to obtain Pc(Xc, Yc, Zc,Wc), a transform between two orthogonal

references in tridimensional space must be done, using matrix V ::

P Tc =

(

Xc Yc Zc Wc

)T

= V(

Xw Yw Zw 1)T

(3-1)

To finally obtain point Pi(wXi, wYi, w) in image coordinate system (or

ICS), a perspective projection is done using calibration matrix K:

P Ti =

(

wXi wYi w

)T

= K(

Xc Yc Zc Wc

)T

(3-2)

Equation 3-3 summarizes the process of converting a 3D point in global

coordinate system into image coordinate system:

P Ti =

wXi

wYi

w

=

f s x 0

0 af y 0

0 0 1 0

r11 r12 r13 t1

r21 r22 r23 t2

r31 r32 r33 t3

0 0 0 1

Xw

Yw

Zw

1

(3-3)


Figure 3.2: Example of a depth image: color image + dense depth map. Darkerpixels in depth map mean greater depth. Courtesy of Zitnick et al [38].

3.2

Depth image representation

When a conventional digital camera is used to take a photograph of a

scene, only the color information is acquired and stored. Depth can be acquired

by special range sensing devices or commercial depth cameras like ZCam [2],

or alternatively through algorithms for depth recovery from stereo. A good

review on depth estimation techniques is presented by Szeliski [28].

This representation composed of a color image and a dense depth map,

with one associated depth value for each pixel in the image, is called depth

image. An example is shown in Figure 3.2.

Depth stored in the depth map can be relative to any selected coordinate

system. Let us assume that depth is written in global coordinate system. It

means that each pixel p(Xi, Yi) in a depth map stores an associated Zw value.

One detail to notice is that depth maps are usually gray-level images,

commonly in 8-bit format. For that reason, a linearization of actual depth

Zw into a depth level d is usually applied before storage. Equation 3-4 shows

how depth can be linearized to generate a value d in range [0, 255], using the

information of minimum and maximum Zw values in each depth map, minZw

and maxZw respectively.

d = 2551

maxZw

− 1Zw

1maxZw

− 1minZw

(3-4)

After fetching pixel p(Xi, Yi) in a depth map, with depth level d in range

[0, 255], actual depth Zw can be retrieved using the inverse of equation 3-4:

Zw =1

d255

( 1minZw

− 1maxZw

) + 1maxZw

(3-5)


Figure 3.3: 3D warping process. First, 3D mesh generated using depth mapfrom reference camera Ci is unprojected to global coordinate system. Then,mesh is projected into a virtual view using camera Cvirtual projection matrix.

3.3

3D warping

The 3D warping, suggested by McMillan [22], consists in constructing a

mesh for a scene with known geometry, and warping that mesh from a reference

view into a desired virtual view. Not only is that choice very interesting for

its suitability to graphics hardware implementation, but also because a depth

map suggests a straighforward regular, dense structure for mesh construction.

Besides, the 3D warping technique does not suffer from side effects that impair

rendering quality in other methods, as it is the case of blurring in splatting

method.

As explained in detail by Debevec et al [10], the technique of view-

dependent texture-mapping (VDTM) can take advantage of projective texture-

mapping, a built-in feature of graphics hardware. In his framework, each view

gives rise to a view-dependent mesh, which is warped to the new viewpoint

and colored with the use of projective texture-mapping. Finally those warped

views are blended to generate the final image. In this section we focus on how

the 3D warping phase is performed.

In our work, we assume that input cameras’ calibration data (in the

form of matrices V and K) and depth images (color + depth) are provided as

input. With that information, it is possible to reconstruct the scene’s geometry

and apply the 3D warping process for view synthesis. This process is depicted

in Figure 3.3. Firstly, depth map for reference view Ci is used to generate a

regular, triangular 3D mesh with the same resolution as the depth map. We

prefer not to downsample the depth map when constructing the 3D mesh so

as to avoid blurring effects in objects’ boundaries, as reported in [13]. In fact,


with current GPUs memory and bandwidth increases, such a dense mesh does

not present a problem for real implementations.

Initially, each mesh’s vertice contains values (Xi, Yi, d). Actual depth Zw

can be retrieved by using equation 3-5. Next step is to retrieve vertice’s global

coordinates (Xw, Yw, Zw), which corresponds to the unprojection step in Figure

3.3.

First, for brevity, we introduce matrix P into equation 3-3, with P =

K × V :

wXi

wYi

w

=

P11 P12 P13 P14

P21 P22 P23 P24

P31 P32 P33 P34

Xw

Yw

Zw

1

(3-6)

That determines the following system of linear equations:

wXi = P11Xw + P12Yw + P13Zw + P14

wYi = P21Xw + P22Yw + P23Zw + P24

w = P31Xw + P32Yw + P33Zw + P34

In that system of equations, Xi, Yi, Zw and matrix P are known. So,

after some algebraic manipulation, we can solve the system for Xw and Yw:

Yw = Xi(c1P31−c2P21)+Yi(c2P11−c0P31)+c0P21−c1P11

Yi(P31P12−P32P11)+Xi(P21P32−P22P31)+P11P22−P21P12

Xw = Yw(P12−P32Xi)+c0−c2Xi

P31Xi−P11

(3-7)

where:

c0 = ZwP13 + P14

c1 = ZwP23 + P24

c2 = ZwP33 + P34

(3-8)

Finally, the 3D mesh must be projected to the new viewpoint Cvirtual,

using its calibration data matrices V and K(Section 3.1), using equation 3-3.

Figure 3.4 illustrates the result of the described 3D warping process for

an input depth image.

3.4

Artifacts inherent to 3D warping

Figure 3.4 shows that the 3D warping process generally succeed in

generating high-quality synthesized views, since color is linearly interpolated

between the transformed sample locations. However, it assumes continuity

between neighboring samples, which is not the case in objects’ boundaries. As

a result, ”rubber sheets” appear between foreground and background objects


Figure 3.4: 3D warping result. Depth image (pair of images to the left) froma reference camera is projected into a virtual view (right image) using thedescribed 3D warping process. In this case, virtual camera Cvirtual was placedslightly to the right of reference camera Ci.

Figure 3.5: 3D warping artifacts due to discontinuity in depth. The continuityassumption in depth map does not hold at objects’ boundaries, which causesundesirable artifacts in the form of stretched triangles in warped view. Whendepth map (left) is warped to new viewpoint, the region to the right of theballet dancer reveals occluded areas.

in the warped view. A close-up on this undesirable behavior is shown in Figure

3.5. Those regions with ”rubber sheets” coincide with occlusion areas, i.e.

regions which have not been captured by reference camera Ci and that become

exposed when warping to new viewpoint Cvirtual. To identify them, we label

vertices with a method similar to the one described by Zitnick et al [38].

Basically, we test a vertice against its four-neigbors to label it as occluded

or not (actually the ”occluded” label should be understood as not trustable,

meaning that it may belong to an occlusion region). We used a threshold for

depth difference to label vertices as occluded (not trustable) or not occluded

(belonging to a region with continuous depth). Although a more sophysticated

border detection algorithm would yield better results, we found that the

proposed method gives good results and is very fast to perform in graphics

cards due to local cache.

Figure 3.6 shows a warped view with occlusion regions drawn in black,

thank to our labeling approach. The missing information in occlusion regions

can be filled with data obtained from other reference cameras, through a

compositing technique. Notice in Figure 3.6 that our algorithm labeled some

regions in the floor as occluded due to discontinuities in depth above the used


Figure 3.6: 3D warping to a new viewpoint, with occluded regions drawn inblack thank to our labeling schema.

threshold.

4

Compositing

As shown in the previous chapter, a single reference view generally does

not have enough information for synthesis of virtual views free from artifacts,

like those caused by occlusions. That is the motivation for using multiple views

for completing the missing information.

The original View-Dependent Texture-Mapping method suggested by

Debevec et al [10] uses many views to build up a virtual one. Their method

computes which polygons are visible in each image, and from which direction.

With that information view maps are built for each polygon, and during

rendering the three closest viewing directions are chosen and their relative

weights for blending computed.

In a simpler but still effective manner, Zitnick et al [38] showed that

good rendering results can be achieved using only two reference cameras for

compositing, provided the baseline of input cameras is not too wide. With that

approach, compositing involves only the trivial determination of which pair

of cameras to use for rendering, and the computation and usage of blending

weights for each pixel.

In this chapter we explain the process of compositing two reference views

for virtual view synthesis. In Section 4.1 we describe how the virtual camera

navigation in such a blending system works. In Section 4.2 we explain our

blending algorithm, and finally present limitations to our compositing method

in Section 4.3.

4.1

Virtual camera navigation

We assume that the cameras setup used to generate the input for our

method is similar to the one used by Zitnick et al [38], which is depicted in

Figure 4.1. The arrangement of input cameras along a 1D arc leads to a simple

way of interpolating cameras: virtual camera can have its movement restricted

to the lines linking each pair of adjacent cameras, as shown in Figure 4.2.

When virtual camera navigation follows that restriction, its matrices

Kvirtual and Vvirtual can be determined by linear interpolation of parameters


Figure 4.1: Example of cameras setup: input cameras arranged along a 1D arc.

Figure 4.2: Navigation of virtual camera Cvirtual restricted to the lines linkingcenter of adjacent pair of input cameras.

Figure 4.3: Virtual camera’s parameters can be determined as a linear inter-polation of adjacent pair of cameras.

for adjacent pair of input cameras i and i + 1, namely Ci and Ci+1. This

interpolation process is illustrated in Figure 4.3, with t representing the

interpolation factor (0 ≤ t ≤ 1). We apply this interpolation process in

the proposed method. Virtual camera’s calibration matrix Kvirtual can be

determined using calibration matrices Ki and Ki+1:

Kvirtual = (1 − t)Ki + tKi+1 (4-1)

A similar approach can be used for determining view matrix Vvirtual,

but with the additional previous step of decomposing view matrices Vi and

Vi+1 into eye positions, represented by vectors eyei and eyei+1, and rotations,

represented by quaternions Qi and Qi+1. The resulting interpolated vector

eyevirtual and quaternion Qvirtual can be converted back into Vvirtual.

Equation 4-2 describes the linear interpolation of eye position, and 4-3

represents the spherical linear interpolation of quaternions [4].

eyevirtual = (1 − t)eyei + t(eyei+1) (4-2)


Figure 4.4: Blending algorithm. Two reference views Ci and Ci+1, adjacent tovirtual viewpoint Cvirtual, are used in composition stage.

Qvirtual = slerp(t, Qi, Qi+1) (4-3)

Equations for decomposing view matrix V into eye position eye and

rotation quaternion Q, and building up V from eye and Q can be found in [4].

A more sophisticated camera navigation can certainly be employed using

similar interpolation principles. For intance, one could use a spline rather

than line segments for virtual camera path, which would result in smoother

navigation.

Also, the viewpoint could move freely without the constraint of in-

between cameras paths, but that would cause sampling issues with extreme

zooming-in or zooming-out: holes might appear. We preferred instead to keep

the camera movement simple, because it is reasonable to assume that the input

cameras arrangement can be planned taking the desirable virtual paths into

consideration.

4.2

Blending algorithm

Figure 4.4 illustrates the compositing process. Two reference cameras

have their views warped and occlusion areas identified as described in Chapter

3, and are blended to generate a final image.

It is interesting that the compositing algorithm have the following

characteristics, which are justified subsequentially:

1. Angular distances between reference cameras and the virtual camera

influence weighting.

2. Visibility test per-pixel.

3. Pixels marked as occluded should be treated differently.


The first characteristic follows the suggestion of Buehler et al’s Unstruc-

tured Lumigraph Rendering [6]: when blending multiple views, angular dis-

tances between the desired viewpoint and the reference cameras’ positions

should be used to produce consistent blending. The closer the viewpoint is to

a reference camera Ci (in matter of angular distance), the more that reference

camera should affect the final color in the rendered image.

Figure 4.5(a) illustrates that concept. Angular distances θi and θi+1 are

used to measure the influence of each reference view in the final pixel color.

For smooth color interpolation [6], weight for view i can be calculated using a

cosine function adapted so that wAngi ∈ [0, 1]:

wAngi = 0.5(1 + cosπθi

θi + θi+1

) (4-4)

Besides, as suggested by Porquet et al [26], a visibility test per pixel

should be done. That justifies the second mentioned desired characteristic in

the list.

However, so as to compensate for errors in depth estimates and cameras

registration, that visibility test should be a soft-Z compare similarly to the

method proposed by Zitnick et al [38].

When the difference between depth Zwifor a pixel pi in view Ci and depth

Zwi+1for the equivalent pixel pi+1 in view Ci+1 is below a threshold value, color

values from pi and pi+1 are blended. Otherwise, the color of the pixel closest

to virtual camera Cvirtual is used. That scenario is depicted in Figure 4.5(b): in

that case, pixel from Ci (lying in the orange object) is much closer to the virtual

camera Cvirtual that is pixel from Ci+1 (lying in the farther green illustrated

object), and therefore the former’s color is used solely to define pv.

Finally, the third characteristic in the list suggest that pixels marked as

occluded be treated differently than others. As already mentioned, those pixels

may result in rubber sheets or reveal unsampled areas in the warped image.

That need is exemplified in Figure 4.5(c), in which the pixel from Ci is marked

as occluded, and in fact that pixel belongs to a rubber sheet in the warped

view from Ci. Therefore, pv gets is color from pixel pi+1, from camera Ci+1.

Finally, having calculated weights for each reference camera (normalized

weights), the final pixel color color(pv) is computed:

color(pv) = color(pi)wi + color(pi+1)wi+1 (4-5)

4.3

Limitations

The compositing algorithm described in the previous section smoothly

interpolates the contribution from each reference view in a pixel-based ap-


4.5(a): Weights based on an-gular distances.

4.5(b): Visibility test perpixel.

4.5(c): Pixels marked as oc-cluded ignored.

Figure 4.5: Cases considered in the compositing process.

Figure 4.6: Frontier r between regions A and B may become undesirably visibledue to the stepped behavior of our compositing algorithm and there existphotometric (e.g. gain) differences in cameras used for capture.

proach. Even though this approach for smooth transition behaves well in most

cases, it may cause an undesirable side-effect.

Refer to Figure 4.6. Consider two neighboring regions (group of connected

pixels) A and B in the final image. Say A contains pixels which were blended,

and B represents an occlusion area, so its pixels have color contribution solely

from one reference view. In that situation, the frontier r between A and B may

become clearly visible, as a result of photometric differences between reference

views. Figure 4.7 exemplifies the appearance of seams. Applying Gaussian blur

at the frontiers of A and B would be a simple solution for the problem, but

that would also soften the objects boundaries. To solve correctly the problem,

an alternative approach should be used for blending seamlessly those areas,

like gradient compositing [27].


Figure 4.7: Visible seams between occlusion and blended areas, to the rightand to the left of woman’s head.

5

IBR on the GPU

In this chapter we describe our proposed implementation for IBR view

synthesis, adapting the methods highlighted in the previous chapters. We begin

by summarizing a basic conceptual algorithm and further develop it into a more

efficient version by exploring current GPUs programmability.

5.1

Conceptual algorithm

Techniques explained in Chapters 3 and 4 lead to a complete solution

for rendering novel views from depth images, whose conceptual algorithm is

depicted in Figure 5.1. The first step is to read input depth images from the

disk into main memory, along with cameras’ matrices: calibration (K) and

view (V ). Then, view-dependent geometry is built for each input camera using

provided depth maps, as described in sections 3.2 and 3.3.

Afterwards, virtual camera is configured using pair of adjacent cameras’

Figure 5.1: Conceptual algorithm for novel view synthesis.


data, using an interpolation process such as the one mentioned in Section 4.1.

Later, meshes for adjacent cameras can be warped into the virtual camera

as highlighted in Section 3.3, and occlusion areas can be identified as shown

in Section 3.4. Finally, the partial results are composited to generate the final

rendered image as described in Section 4.2.

Following sections detail these steps and introduce modifications in the

process so as to improve performance.

5.2

Creation of view-dependent geometry

The basic algorithm in the previous section suggests that view-dependent

geometry be created for all cameras in a frame-basis. At every new frame, each

of n input cameras’ depth map with resolution WxH (assuming that all images

have the same resolution) needs to be processed to yield a 3D mesh with WxH

vertices, through method mentioned in Section 3.3. Later on the process, two

out of those n meshes are sent to the GPU to be warped.

A key observation that lead to a economy both in memory footprint

and in CPU-GPU transfer time follows. All generated meshes are regular and

share the same X,Y coordinates for corresponding vertices, provided images

resolution is the same for all input cameras. Only the Z coordinate, which

comes from the depth map, change from mesh to mesh.

So, a straightforward optimization is to generate a single vertex buffer

with a 2D mesh, with only Xi and Yi coordinates (following notation defined

in Section 3.3). Assuming images resolution is the same for every frame and

every input camera, the mesh can be created only once and stored in the GPU

memory.

At each new frame, color images and depth maps for pair of neighboring

cameras are sent as textures to GPU, and the mesh defined in the vertex buffer

is rendered once for each camera. A vertex shader is responsible for fetching

the depth map at texture coordinates Xi, Yi and determining the coordinate

Zw, through equations mentioned in sections 3.2 and 3.3.

In our implementation, we create a static OpenGL’s vertex buffer object

(VBO) [31] for storing the 2D mesh. With that simple modification, the

mesh is transferred only once to the GPU in an initialization phase, yielding

great improvement in rendering performance and transfer-time. Figure 5.2

summarizes the described process. In the initialization phase we also create a

pair of target render textures (FBOs in OpenGL), which are used as temporary

render targets for pair of input cameras warping results, which are blended

later.


Figure 5.2: Vertex-buffer object used for improving performance during meshgeneration.

5.3

Warping to novel viewpoint and occlusions identification

In our implementation, we use the same vertex shader mentioned in the

previous section to also perform 3D warping and occlusion areas identification.

Inside the vertex shader, after the Zw coordinate determination, we re-

trieve vertice’s global coordinates (Xw, Yw, Zw) from (Xi, Yi, Zw) using expres-

sions 3-7.

Finally, the vertice is projected into the novel viewpoint using virtual

camera’s calibration and view matrices, through equation 3-3, carrying the

color information read from the color texture.

Besides, the same vertex shader is responsible for sampling the vertice

neighbors for determining whether it belongs to an occlusion region or not,

with the method defined in Section 3.4.

Therefore the input of the vertex shader is:

– Minimum and maximum depth values for Zw unpacking (Section 3.2)

– Input camera projection matrix (K * V)

– Texture with color image

– Texture with depth map

– Virtual camera calibration and view matrix

It unpacks depth Zw, unprojects vertice into global coordinate system,

warps it into the virtual view and labels vertice as belonging to an occlusion

area or not. In conclusion, the vertex shader output is:

– Occlusion label


Figure 5.3: Reference view rendering into a FBO, using vertex and fragmentshaders.

– Depth Zw

– Texture coordinates for color texture, which will be interpolated by the

hardware rasterizer.

A fragment shader in our implementation is responsible for rendering the

final warped image inside a frame-buffer object, both color and depth data. It

fetches the color texture with the input texture coordinates, stores occlusion

label into color alpha channel and outputs depth Zw, in global coordinates.

Figure 5.3 summarizes the process of rendering a single reference view

into a temporary frame buffer.

5.4

Compositing on the GPU

After rendering reference views into separate buffers, we perform com-

positing: a full-screen quadrilateral is used to trigger a fragment shader, which

in turn performs the necessary blending computations mentioned in Chapter

4, and outputs the pixel values directly to the screen.

The input of this blending fragment shader is:

– Virtual and references cameras position: used for angular distances

computation (Section 4.2)

– Textures with color and depth data for warping results of reference

cameras

Compositing is done in a similar fashion to the one described in Section

4.2. To meet the characteristics mentioned in that Section, here we propose

and detail a penalty-based calculation of contribution weights for reference

cameras.

The first desired characteristic for the compositing weights, namely

angular distance influence on reference camera’s weight, can be achieved by


Figure 5.4: Influence of angular distance in reference camera’s contributionduring compositing.

using Equation 4-4 (Section 4.2). The behavior of that equation is depicted in

Figure 5.4. The cosine-based equation guarantees a smooth variation of weights

when changing the viewpoint, and also gives greater weight as the viewpoint

gets closer to a reference camera.

Next we add a visibility test per-pixel, using a threshold τ to account

for errors in calibration and depth estimation (refer to Section 4.2 for more

details). By comparing depths for pixels pi and pi+1 from cameras Ci and Ci+1

(distance relative to the virtual camera), we can define visibility factors for

both reference cameras:

closer(pi) =

1, if Zi − Zi+1 < −τ

0, otherwise

closer(pi+1) =

1, if Zi+1 − Zi < −τ

0, otherwise

The interpretation to use those factors is: when pixel pi is much closer to

the virtual camera than pi+1, wi should be increased, or decreased when pi+1 is

much closer. Those observations yield the modified equation for weights (refer

to Equation 4-4 for wAngi derivation):

wi = clamp(wAngi + closer(pi) − closer(pi+1), 0.0, 1.0) (5-1)

The missing part is the treatment of pixels marked occluded. For that

we derive a penalty considering the following ideas:

– when pixel pi is marked occluded, weight wi should be penalized

– the penalty should not be applied when a reference camera almost

coincides with the virtual camera, otherwise it should strongly affect

that camera’s weight

We propose a method based on angular distances to build a penalty term

for pixels marked occluded. When the virtual camera is located very close to a


Figure 5.5: Penalty for pixels marked occluded based on angular distance.

reference camera, this camera’s occlusion areas should not be heavily penalized,

since they barely reveal any occluded areas. On the other hand, when virtual

camera is far from a reference camera, areas marked as occluded certainly can

expose unsampled areas, and therefore should be penalized in favor of the use

of the other camera’s data.

Equation 5-2 shows how this penalty is calculated in our implementation,

using the same notation defined in Section 4.2:

φi = occi(aθi

θi + θi+1

)4 (5-2)

Its behavior is depicted in Figure 5.5. Coefficient a controls the increase

due to angle distance (we used a = 30 in our implementation for the used

datasets). occi corresponds to the occlusion label defined in previous Sections.

We incorporate this penalty to the weight equation to get the final weight

value wi for camera Ci, and Equation 5-1 turns into:

wi = min(1.0, φi+1+(1−φi)clamp(wAngi+φicloser(pi)−φi+1closer(pi+1), 0.0, 1.0))

(5-3)Basically the penalty for pixels marked occluded attenuates both the an-

gular distance weight and the visibility test results. That equation summarizes

the mentioned desired characteristics:

– when virtual camera is very close to a reference camera, the virtual image

is almost identical to the reference one

– transitions are smooth and based on angular distances

– pixels marked occluded are treated differently to avoid rubber sheets on

unsampled regions

– visibility ordering is enforced

Those weights are used to determine the final composited color as defined

in equation 4-5 from Section 4.2 (since wi is normalized, wi+1 = 1 − wi).


Finally, we can summarize our proposed technique after considering

all optimizations aforementioned in previous sections. Following algorithm

overviews the steps involved in rendering of a novel view.

Phase 1: Initialization

1.1 Create and transfer vertex buffer object for a 2D mesh with the

same resolution as input images

1.2 Create two frame buffer objects for temporary storage

1.3 Create textures, vertex shaders and fragment shaders

Phase 2: Rendering reference views separately

2.1 [CPU] Read input images and cameras’ data

2.2 [CPU] Update virtual camera’s position and orientation

2.3 [CPU] Determine neighbor cameras i and i + 1

2.4 [CPU] Activate FBO1 as render target

2.5 [GPU] Render camera i (3D warping and occlusion labeling)

2.6 [CPU] Activate FBO2 as render target

2.7 [GPU] Render camera i+1 (3D warping and occlusion labeling)

Phase 3: Compositing

3.1 [CPU] Use screen as render target

3.2 [CPU] Draw a full-screen quadrilateral to trigger compositing

fragment shader

3.3 [GPU] Compute angular distances, penalties and weights

3.4 [GPU] Composite color from reference views using final com-

puted contribution weights

6

Results

We tested our rendering framework with frames from the breakdancing

and ballet scenes provided by the Interactive Visual Media Group of Microsoft

Research [38]. Each dataset consists of a sequence of 100 frames of a dy-

namic scene. Their system captured the dynamic scene with 8 high-resolution

(1024x768) video cameras arranged along an 1D arc, spanning about 30◦ from

one end to another. Each frame of the sequences has two images: one BMP

color image of the real scene and one BMP image for the associated depth map,

which was estimated with their specific segmentation-based stereo algorithm.

Along with the frames for the dynamic scenes, they provide calibration

data for each of the 8 cameras used for capture, consisting of the calibration

and rotation matrices. Since our framework uses OpenGL for rendering,

these matrices are converted for the OpenGL analogous matrices, namely the

projection matrix and the modelview matrix, respectively.

Even though the provided depth maps in the datasets cannot be consid-

ered ground-truth, their quality is acceptable for testing the effectiveness of

our proposed rendering method. Figure 6.1 contains examples of one frame of

the used scenes.

6.1

Rendering quality

In the first test devised to evaluate rendering quality, we used a single

frame from the ballet and breakdancers sequences as input and set the

viewpoint evenly spaced between two cameras. Figure 6.2 shows the result

of this test for frame 48 in ballet sequence, using depth images from cameras 4

and 5 as input, while Figure 6.3 depicts the result for frame 88 in breakdancers

sequence, using cameras 2 and 3 as input. It is noticeable how each view

consistently completes the missing parts of its counterpart’s occlusion areas,

generating no visible artifacts.

The second test consisted in setting the virtual camera’s position co-

incidently with a input camera’s position, to check the amount of artifacts

introduced by the proposed method. Figures 6.4 and 6.5 show that although


Figure 6.1: Sample data from Ballet (left) and Breakdancers(right) sequences:color images on top, and corresponding depth images below [38].

Figure 6.2: Synthesized images for one frame of ballet sequence. Left andmiddle columns respectively correspond to cameras’ 5 and 4 warping results,and right column is the final result. First row: occlusion areas not identified andrubber sheets appear. Second row: we apply the proposed labeling approach.

Figure 6.3: Synthesized images for one frame of breakdancers sequence. Leftand middle columns respectively correspond to cameras’ 3 and 2 warpingresults, and right column is the final result. First row: occlusion areas notidentified and rubber sheets appear. Second row: we apply the proposedlabeling approach.


Figure 6.4: Virtual camera positioned coincidently with camera’s 5 position inballet sequence, and cameras 5 and 6 used as reference cameras. Left column:original image. Middle column: synthetic image. Right column: differencesbetween real and synthetic images. Negated for ease of printing: equal pixelsin white.

Figure 6.5: Virtual camera positioned coincidently with camera’s 5 position inballet sequence, and cameras 5 and 6 used as reference cameras. Left column:original image. Middle column: synthetic image. Right column: differencesbetween real and synthetic images. Negated for ease of printing: equal pixelsin white.

some differences exist between the real and the synthetic image, they are barely

noticeable to the human eye. However, those differences happen mainly at ob-

jects boundaries, which can become a problem if the method is used to videos:

cracks may appear during frames transitions [38].

We also tested the proposed method for videos. We built up videos for the

sequences of images provided by Zitnick et al [38]: for each input camera, two

videos were generated, one for color and one for depth, keeping the acquisition

frame-rate of 15 FPS. Those videos then worked as input for our method, and

some results can be seen in the supporting website [24].

We can see that the proposed method performs reasonably well for videos,

creating smooth cameras interpolation in real-time, despite the appearance of

some artifacts between consecutive frames. This behavior is expected, since we

do not deal with temporal coherence during rendering.


Figure 6.6: Close-up of rendering result for frame 48 in ballet sequence. Leftcolumn: original photo from camera 7. Middle column: estimated depth map.Right column: visible seams below dancer caused by wrong depth estimates.

Figure 6.7: Close-up of rendering result for frame 88 in breakdancers sequence.Left column: original photo from camera 7. Middle column: estimated depthmap. Right column: visible seams close to dancer’s foot caused by wrong depthestimates.

6.2

Limitations

Errors in depth maps estimates may cause artifacts, as can be seen in

Figures 6.6 and 6.7.

In the first case of Figure 6.6, depth Z for the shorts of the dancer equals

the background’s depth Z. Since the occlusion areas identified by the proposed

algorithm is based on the erroneous depth map, they do not coincide with the

real object’s boundaries, and seams become noticeable.

Figure 6.7 depicts the same problem with wrong depth estimates for the

breakdancers sequence, which also results in visible seams.

Another limitation of the proposed algorithm is the ’shadowing’ effect

aforementioned in Section 4.3, which happens on frontiers between occlusion

areas and blended regions. Figures 6.8(a) and 6.8(b) are examples of this issue,

that can be solved by previous color calibration of cameras. Finally, to achieve

real-time performance, the proposed method does not handle the matting

problem directly. Although dealing with the matting problem effectively,

Zitnick et al [38] calculates mattes during a offline step. Our simplification

causes a speed-up compared to their approach, avoiding the necessity of a


6.8(a): Shadowing effect for ballet se-quence.

6.8(b): Shadowing effect for break-dancers sequence.

Figure 6.8: Close-ups of rendering artifacts (shadowing) for frame 48 in balletsequence and frame 88 in breakdancers sequence.

6.9(a): Matting absence causes arti-facts for ballet sequence.

6.9(b): Matting absence causes arti-facts for breakdancers sequence.

Figure 6.9: Close-ups of rendering artifacts (due to matting absence) for frame48 in ballet sequence and frame 88 in breakdancers sequence.

pre-processing step, but also creates some minor artifacts as shown in Figures

6.9(a) and 6.9(b).

6.3

Time-performance analysis

It is desirable for any IBR method that its time-performance depend

solely or mainly on the input images’ resolution. In other words, the complexity

of the captured scene must not interfere severely on the time needed for view

synthesis.

To verify this characteristic in our method, we used a testing set con-

sisting of five sets of depth images. Images’ dimensions in each set used were:

320x240, 640x480, 1024x768, 1600x1200, 1920x1440. The computer used was a

workstation with a Intel R©Core 2 Quad 2.4GHz CPU, with 2 GB RAM, and a

NVidia R©GeForce 9800 GTX graphics card. The graph in Figure 6.10 depicts

the results of the test in that machine. As we can see in Figure 6.10, the per-

formance is linear in the number of images’ total number of pixels as desired.

Besides, we can notice that the method can be effectively applied for rendering


Figure 6.10: Render time X input images’ number of pixels.

high-definition (HD) images at interactive rates: a 30 FPS rate was achieved

using the mentioned workstation.

6.4

Summary

In this chapter we showed analyses of our method. We could verify that

the use of our algorithm generates little noticeable artifacts in rendered images,

but also that it has some limitations.

Besides, we could analyze the general time-performance expected for our

system, and concluded that our system can be applied for rendering of HD

images at interactive rates.

7

Conclusion and Future works

This work described a method for rendering virtual views of a scene in

real-time, using a collection of depth images and cameras calibration data as

input. We proposed modifications to existing techniques so as to speed up the

rendering process, which was accomplished with the usage of the computer

graphics hardware.

Even though the generated images using our method present an accept-

able level of realism, they contain some artifacts especially at objects’ borders,

which are coincident with areas of discontinuities in the input depth images.

Improving the proposed method for identification of discontinuities in depth

or alternatives to the proposed compositing scheme could alleviate that defi-

ciency. Related works apply matting to avoid artifacts and cracking on objects’

borders, but in an offline stage. A real-time alternative would be preferable,

so it deserves further investigation.

Another limitation of our work is that seams appear when blending color-

flat areas captured with cameras that have not been color-calibrated. Related

works do not mention how to deal with this problem, but apparently some

pre- or post-processing is done to account for the problem. A more desirable

approach would be to modify the proposed method to work out those seams.

Overall, the application developed for this work can be effectively used

to manipulate the virtual viewpoint and to evaluate the rendering resulting

quality in real-time, even for full HD images. It can be further improved to

integrate new rendering methods or new means of interaction.

We plan to continue this work in four directions. First, deal with

the weaknesses of the proposed blending method, by using gradient-domain

compositing so as to avoid artifacts due to gain differences in cameras. Second,

develop a robust and real-time module for estimating the dense depth maps

for each input image. Third, extend this framework to work with videos, but

taking spatio-temporal coherence into account both in the geometry estimation

and in the rendering method. Finally, we look forward to build a real-time

framework integrating these ideas, capable of rendering high-quality videos

with free viewpoint at interactive rates.

Bibliography

[1] Warner bros. http://www.warnerbros.com/.

[2] Zcam. http://www.3dvsystems.com/.

[3] ADELSON, E. H.; BERGEN, J. R. The plenoptic function and

the elements of early vision. In: COMPUTATIONAL MODELS OF

VISUAL PROCESSING, p. 3–20. MIT Press, 1991.

[4] AKENINE-MOLLER, T.; HAINES, E. ; HOFFMAN, N. Real-Time

Rendering 3rd Edition. A. K. Peters, Ltd., Natick, MA, USA, 2008.

[5] BUEHLER, C.; BOSSE, M.; MCMILLAN, L.; GORTLER, S. ; CO-

HEN, M. Unstructured lumigraph rendering. In: SIGGRAPH ’01:

PROCEEDINGS OF THE 28TH ANNUAL CONFERENCE ON COM-

PUTER GRAPHICS AND INTERACTIVE TECHNIQUES, p. 425–432,

New York, NY, USA, 2001. ACM.

[6] BUEHLER, C.; BOSSE, M.; MCMILLAN, L.; GORTLER, S. ; CO-

HEN, M. Unstructured lumigraph rendering. In: IN COMPUTER

GRAPHICS, SIGGRAPH 2001 PROCEEDINGS, p. 425–432, 2001.

[7] CHEN, S. E. Quicktime vr: an image-based approach to virtual

environment navigation. In: SIGGRAPH ’95: PROCEEDINGS OF

THE 22ND ANNUAL CONFERENCE ON COMPUTER GRAPHICS

AND INTERACTIVE TECHNIQUES, p. 29–38, New York, NY, USA,

1995. ACM.

[8] CHEN, S. E.; WILLIAMS, L. View interpolation for image syn-

thesis. In: SIGGRAPH ’93: PROCEEDINGS OF THE 20TH ANNUAL

CONFERENCE ON COMPUTER GRAPHICS AND INTERACTIVE

TECHNIQUES, p. 279–288, New York, NY, USA, 1993. ACM.

[9] CHUANG, Y.-Y.; CURLESS, B.; SALESIN, D. H. ; SZELISKI, R. A

bayesian approach to digital matting. In: PROCEEDINGS OF IEEE

CVPR 2001, volume 2, p. 264–271. IEEE Computer Society, December

2001.


[10] DEBEVEC, P.; YU, Y. ; BOSHOKOV, G. Efficient view-dependent

image-based rendering with projective texture-mapping. Techni-

cal Report UCB/CSD-98-1003, EECS Department, University of Califor-

nia, Berkeley, 1998.

[11] DEBEVEC, P. E. Modeling and Rendering Architecture from

Photographs. PhD thesis, University of California at Berkeley, Com-

puter Science Division, Berkeley CA, 1996.

[12] DONNER, C.; JENSEN, H. W. Light diffusion in multi-layered

translucent materials. ACM Trans. Graph., 24(3):1032–1039, 2005.

[13] GOLDLUCKE, B.; MAGNOR, M. ; WILBURN, B. Hardware-

accelerated dynamic light field rendering. In: Greiner, G.; Niemann,

H.; Ertl, T.; Girod, B. ; Seidel, H.-P., editors, PROCEEDINGS VISION,

MODELING AND VISUALIZATION VMV 2002, p. 455–462, Erlangen,

Germany, November 2002. aka.

[14] GORTLER, S. J.; GRZESZCZUK, R.; SZELISKI, R. ; COHEN, M. F.

The lumigraph. In: SIGGRAPH ’96: PROCEEDINGS OF THE 23RD

ANNUAL CONFERENCE ON COMPUTER GRAPHICS AND INTER-

ACTIVE TECHNIQUES, p. 43–54, New York, NY, USA, 1996. ACM.

[15] GREENE, N.; HECKBERT, P. S. Creating raster omnimax images

from multiple perspective views using the elliptical weighted

average filter. IEEE Comput. Graph. Appl., 6(6):21–27, 1986.

[16] HARTLEY, R. I.; ZISSERMAN, A. Multiple View Geometry in

Computer Vision. Cambridge University Press, ISBN: 0521623049,

2000.

[17] KANADE, T.; RANDER, P. ; NARAYANAN, P. J. Virtualized reality:

Constructing virtual worlds from real scenes. IEEE MultiMedia,

4(1):34–47, 1997.

[18] KANG, S. B.; SZELISKI, R. Extracting view-dependent depth

maps from a collection of images. Int. J. Comput. Vision, 58(2):139–

163, 2004.

[19] LEVOY, M.; HANRAHAN, P. Light field rendering. In: SIGGRAPH

’96: PROCEEDINGS OF THE 23RD ANNUAL CONFERENCE ON

COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, p. 31–

42, New York, NY, USA, 1996. ACM.


[20] MACMILLAN, T. The wizard of the toric camera, 1986.

[21] MAGNOR, M. A. Video-Based Rendering. AK Peters Ltd, 2005.

[22] MCMILLAN, L. An image-based approach to three-dimensional

computer graphics. Technical Report UNC/TR97-013, UNC Computer

Science, University of North Carolina, 1997.

[23] MCMILLAN, L.; BISHOP, G. Plenoptic modeling: an image-based

rendering system. In: SIGGRAPH ’95: PROCEEDINGS OF THE

22ND ANNUAL CONFERENCE ON COMPUTER GRAPHICS AND

INTERACTIVE TECHNIQUES, p. 39–46, New York, NY, USA, 1995.

ACM.

[24] PALOMO, C. M. Master’s thesis supporting site.

http://www.tecgraf.puc-rio.br/˜cpalomo/thesis/, 2009.

[25] PHARR, M.; FERNANDO, R. GPU Gems 2: Programming Tech-

niques for High-Performance Graphics and General-Purpose

Computation (Gpu Gems). Addison-Wesley Professional, 2005.

[26] PORQUET, D.; DISCHLER, J.-M. ; GHAZANFARPOUR, D. Real-

time high-quality view-dependent texture mapping using per-

pixel visibility. In: GRAPHITE ’05: PROCEEDINGS OF THE

3RD INTERNATIONAL CONFERENCE ON COMPUTER GRAPHICS

AND INTERACTIVE TECHNIQUES IN AUSTRALASIA AND SOUTH

EAST ASIA, p. 213–220, New York, NY, USA, 2005. ACM.

[27] PREZ, P.; GANGNET, M. ; BLAKE, A. Poisson image editing. ACM

Transactions on Graphics (SIGGRAPH’03), 22(3):313–318, 2003.

[28] SCHARSTEIN, D.; SZELISKI, R. ; ZABIH, R. A taxonomy and

evaluation of dense two-frame stereo correspondence algorithms.

International Journal of Computer Vision, 47:7–42, 2002.

[29] SEITZ, S. M.; DYER, C. R. View morphing. In: SIGGRAPH

’96: PROCEEDINGS OF THE 23RD ANNUAL CONFERENCE ON

COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, p. 21–

30, New York, NY, USA, 1996. ACM.

[30] SHADE, J.; GORTLER, S.; HE, L.-W. ; SZELISKI, R. Layered depth

images. In: SIGGRAPH ’98: PROCEEDINGS OF THE 25TH ANNUAL

CONFERENCE ON COMPUTER GRAPHICS AND INTERACTIVE

TECHNIQUES, p. 231–242, New York, NY, USA, 1998. ACM.


[31] SHREINER, D.; WOO, M. ; NEIDER, J. OpenGL(R) Programming

Guide: The Official Guide to Learning OpenGL, Version 1.2.

Addison-Wesley Longman, Amsterdam, ISBN: 0521623049, 3rd edition,

2000.

[32] SZELISKI, R.; SHUM, H.-Y. Creating full view panoramic image

mosaics and environment maps. In: SIGGRAPH ’97: PROCEED-

INGS OF THE 24TH ANNUAL CONFERENCE ON COMPUTER

GRAPHICS AND INTERACTIVE TECHNIQUES, p. 251–258, New

York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co.

[33] TANIMOTO, M. Ftv (free viewpoint television) for 3d scene

reproduction and creation. In: CVPRW ’06: PROCEEDINGS OF

THE 2006 CONFERENCE ON COMPUTER VISION AND PATTERN

RECOGNITION WORKSHOP, p. 172, Washington, DC, USA, 2006.

IEEE Computer Society.

[34] TAYLOR, D. Timetrack. http://www.timetrack.com/.

[35] VAISH, V.; WILBURN, B.; JOSHI, N. ; LEVOY, M. Using plane +

parallax for calibrating dense camera arrays. Computer Vision and

Pattern Recognition, IEEE Computer Society Conference on, 1:2–9, 2004.

[36] VEDULA, S.; BAKER, S. ; KANADE, T. Image-based spatio-

temporal modeling and view interpolation of dynamic events.

ACM Trans. Graph., 24(2):240–261, 2005.

[37] WILBURN, B.; SMULSKI, M.; LEE, K. ; HOROWITZ, M. A. The light

field video camera. In: IN MEDIA PROCESSORS 2002, p. 29–36, 2002.

[38] ZITNICK, L. C.; KANG, S. B.; UYTTENDAELE, M.; WINDER, S.

; SZELISKI, R. High-quality video view interpolation using a

layered representation. In: SIGGRAPH ’04: ACM SIGGRAPH 2004

PAPERS, p. 600–608, New York, NY, USA, 2004. ACM.

C´esar Morais Palomo Interactive image-based rendering for...

Documents

Transcript of C´esar Morais Palomo Interactive image-based rendering for...