3D object articulation and motion estimation in model ... object articulation and motion...

24
* Corresponding author. Tel.: #30 31 996 359; fax: #30 31 996 398; e-mail: tzovaras@dion.ee.auth.gr 1 This work was supported by the EU CEC Project ACTS PANORAMA (Package for New Autostereoscopic Multiview Systems and Applications, ACTS project 092). Signal Processing: Image Communication 14 (1999) 817}840 3D object articulation and motion estimation in model-based stereoscopic videoconference image sequence analysis and coding1 Dimitrios Tzovaras*, Ioannis Kompatsiaris, Michael G. Strintzis Information Processing Laboratory, Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, Thessaloniki 54006, Greece Received 29 November 1996 Abstract This paper describes a procedure for model-based analysis and coding of both left and right channels of a stereoscopic image sequence. The proposed scheme starts with a hierarchical dynamic programming technique for matching across the epipolar line for e$cient disparity/depth estimation. Foreground/background segmentation is initially based on depth estimation and is improved using motion and luminance information. The model is initialised by the adaptation of a wireframe model to the consistent depth information. Robust classi"cation techniques are then used to obtain an articulated description of the foreground of the scene (head, neck, shoulders). The object articulation procedure is based on a novel scheme for the segmentation of the rigid 3D motion "elds of the triangle patches of the 3D model object. Spatial neighbourhood constraints are used to improve the reliability of the original triangle motion estimation. The motion estimation and motion "eld segmentation procedures are repeated iteratively until a satisfactory object articulation emerges. The rigid 3D motion is then re-computed for each sub-object and "nally, a novel technique is used to estimate #exible motion of the nodes of the wireframe from the rigid 3D motion vectors computed for the wireframe triangles containing each speci"c node. The performance of the resulting analysis and compression method is evaluated experimentally. ( 1999 Elsevier Science B.V. All rights reserved. Keywords: Stereoscopic image sequence analysis; Model-based coding; Object articulation; Non-rigid 3D motion estimation 1. Introduction The transmission of full-motion video through limited capacity channels is critically dependent on the ability of the compression schemes to reach target bit-rates while still maintaining acceptable visual quality 0923-5965/99/$ - see front matter ( 1999 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 3 - 5 9 6 5 ( 9 8 ) 0 0 0 4 6 - 0

Transcript of 3D object articulation and motion estimation in model ... object articulation and motion...

Page 1: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

*Corresponding author. Tel.: #30 31 996 359; fax: #30 31 996 398; e-mail: [email protected] work was supported by the EU CEC Project ACTS PANORAMA (Package for New Autostereoscopic Multiview Systems and

Applications, ACTS project 092).

Signal Processing: Image Communication 14 (1999) 817}840

3D object articulation and motion estimationin model-based stereoscopic videoconference

image sequence analysis and coding1

Dimitrios Tzovaras*, Ioannis Kompatsiaris, Michael G. Strintzis

Information Processing Laboratory, Electrical and Computer Engineering Department, Aristotle University of Thessaloniki,Thessaloniki 54006, Greece

Received 29 November 1996

Abstract

This paper describes a procedure for model-based analysis and coding of both left and right channels of a stereoscopicimage sequence. The proposed scheme starts with a hierarchical dynamic programming technique for matching acrossthe epipolar line for e$cient disparity/depth estimation. Foreground/background segmentation is initially based ondepth estimation and is improved using motion and luminance information. The model is initialised by the adaptation ofa wireframe model to the consistent depth information. Robust classi"cation techniques are then used to obtain anarticulated description of the foreground of the scene (head, neck, shoulders). The object articulation procedure is basedon a novel scheme for the segmentation of the rigid 3D motion "elds of the triangle patches of the 3D model object.Spatial neighbourhood constraints are used to improve the reliability of the original triangle motion estimation. Themotion estimation and motion "eld segmentation procedures are repeated iteratively until a satisfactory objectarticulation emerges. The rigid 3D motion is then re-computed for each sub-object and "nally, a novel technique is usedto estimate #exible motion of the nodes of the wireframe from the rigid 3D motion vectors computed for the wireframetriangles containing each speci"c node. The performance of the resulting analysis and compression method is evaluatedexperimentally. ( 1999 Elsevier Science B.V. All rights reserved.

Keywords: Stereoscopic image sequence analysis; Model-based coding; Object articulation; Non-rigid 3D motionestimation

1. Introduction

The transmission of full-motion video through limited capacity channels is critically dependent on theability of the compression schemes to reach target bit-rates while still maintaining acceptable visual quality

0923-5965/99/$ - see front matter ( 1999 Elsevier Science B.V. All rights reserved.PII: S 0 9 2 3 - 5 9 6 5 ( 9 8 ) 0 0 0 4 6 - 0

Page 2: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

[15]. In order to achieve this, motion estimation and motion compensated prediction are frequently used, soas to reduce temporal redundancy in image sequences [22]. Similarly in the coding of stereo and multiviewimages, prediction may be based on disparity compensation [33] or the best of motion and disparitycompensation [34].

Stereoscopic video processing has recently been the focus of considerable attention in the literature[3,6,10,13,16,24,26,28,32]. A stereoscopic pair of image sequences, recorded with a di!erence in the viewangle, allows the three-dimensional (3D) perception of the scene by the human observer, by exposing to eacheye the respective image sequence. This creates an enhanced 3D feeling and increased `telepresencea inteleconferencing and several other (medical, entertainment, etc.) applications.

In both monoscopic and stereoscopic vision, the ability of model-based techniques to describe a scene ina structural way has opened new areas of applications. Video production, realistic computer graphics,multimedia interfaces and medical visualisation are some of the applications that may bene"t by exploitingthe potential of model-based schemes.

Object-based techniques have been extensively investigated for monoscopic image sequence coding[4,11,12,21]. Several object-oriented coding schemes have also been proposed for stereoscopic imagesequence coding [7,24,25,29,31,32]. The advantages of using model-based techniques for stereo imagesequence coding were reviewed in [25], where a feature-based 3D motion estimation scheme was presented.In [24], disparity is estimated using a dynamic programming scheme and is subsequently used for objectsegmentation. The segmentation algorithm is based on region growing and the criterion used for thede"nition of each object is based on the homogeneity of the respective disparity "elds. In [10], the objects inthe scene are identi"ed using a segmentation method based on the homogeneity of the 2D motion "eldcomputed by a block matching procedure. Then the 3D motion of each object is modeled using the approachpresented in [1] with depth estimated from disparity. Finally, an interframe coding scheme based on 3Dmotion compensation is evaluated. A disadvantage of the segmentation technique used in this procedure, isits failure to guarantee high performance of the resulting 3D motion compensation method.

Alternatively, 3D models of objects may be derived from stereo images. This usually requires estimation ofdense disparity "elds, postprocessing to remove erroneous estimates and "tting of a parametrised surfacemodel to the calculated depth map [14]. In [17] an algorithm was presented which optimally models thescenes using a hierarchically structured wire-frame model derived directly from intensity images. Thewire-frame model consists of adjacent triangles that may be split into smaller ones over areas that need to berepresented in higher detail. The motion of the model surface using both rigid and non-rigid bodyassumptions is estimated concurrently with depth parameters. Knowledge-based image sequence coding hasalso attracted much interest recently, especially for the coding of facial image sequences in videophoneapplications. In [2], one such method is based on the generation of a generic face model and the use ofe$cient techniques for rigid and #exible 3D motion estimation.

In the present paper, a procedure for model-based analysis and coding of both left and right channels ofa stereoscopic image sequence is proposed. The methodology used, overcomes a major obstacle in stereo-scopic video coding, caused by the di$cult problem of determining and handling coherently correspondingobjects in the left and right images. This is achieved in this paper by de"ning segmentation and objectarticulation in the 3D space, thus ensuring that all ensuing operations remain coherent for both the left andthe right aspects of the scene. Each object is described by a mesh consisting of a set of interconnectedtriangles. The 3D motion of each triangle is estimated using a robust algorithm for the minimisation of theleast median of squares error and by imposing neighbourhood constraints, such as introduced in [18,19], toguarantee the smoothness of the resulting vector "eld. A novel iterative object articulation technique forstereoscopic image sequences is then used to segment the 3D vector "eld and thus to derive a foregroundobject articulation. Triangle motion estimation and classi"cation are repeated iteratively until satisfactoryobject articulation is achieved. Rigid 3D motion estimation is performed next for each resulting sub-object,using motion information from both left and right cameras. Finally, a procedure is proposed for the

818 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 3: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 1. Stereoscopic camera geometry.

estimation of the non-rigid motion of each wireframe node based on the 3D motion of the neighbouringwireframe triangles.

The paper is organised as follows. In Section 2 the camera geometry of the stereoscopic system isdescribed. Next, in Section 3 an overview of the proposed stereoscopic image sequence analysis system ispresented. Section 4 presents the techniques used for disparity/depth estimation, foreground/backgroundsegmentation and model initialisation. The technique used for object articulation is examined in Section 6while the 3D motion estimation procedure used is presented in Section 7. The rigid 3D motion estimationprocedure for each articulated 3D object is discussed in Section 7.1. Finally, in Section 7.2, an approach isconsidered for non-rigid motion estimation based on the rigid 3D motion vectors of small surface patches,computed during the object articulation procedure. Experimental results given in Section 8 demonstrate theperformance of the proposed methods. Conclusions are drawn in Section 9.

2. Camera geometry

The geometry of the stereoscopic camera arrangement used is shown in Fig. 1, where three referencecoordinate frames are de"ned:

f World reference frame, attached to the imaged scene.f Camera reference frame, attached to the camera system. Notice that the Z-axis is the optical axis, while the

X6c

and >6c

axes are parallel to the image plane. Here c refers to the respective camera, i.e. c"l, r for theleft, respectively, right cameras.

f Image reference frame, where the X&c

and >&c

axes, respectively, de"ne the horizontal and verticaldirections on the digital image, where again c"l, r refer to the images produced, respectively, by the left,right cameras.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 819

Page 4: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 2. The proposed stereoscopic image sequence coding scheme.

The camera geometry is described by the following set of equations mapping the 3D world-coordinates(x

8, y

8, z

8) of a generic point P

8into the 2D coordinates (X

&c,>

&c) of its projection on the image planes:

f Change of reference frame from world-coordinates to camera-coordinates:

Pc"C

xc

yc

zcD"R

cCx8

y8

z8D#T

c, (1)

where c"l, c"r, for the left and right cameras, respectively, and Rcand T

care, respectively, the rotation

matrix and the translation vector.f Perspective projection of a scene point to the image plane (the centre of projection is the centre of the lens

and the projection plane is the camera CCD sensor):

P6"C

X6c>

6cD"

fc

zcCxc

ycD, (2)

f Change of coordinate frame from camera-coordinates (X6c,>

6c) to image coordinates (X

&c,>

&c). This

operation simply consists of a 2D translation and scale change

X&c"C

xc#

X6c

dx

, >&c"C

yc#

>6c

dy

, (3)

where dx

and dyare the horizontal and vertical size of an image pixel, respectively, and (C

xc, C

yc) are the

image coordinates of the optical centre OC in camera c.

As seen from the above description, the camera geometry is completely speci"ed by a small set of parametersestimated during camera calibration.

3. Overview of the stereoscopic image sequence analysis and coding scheme

In the proposed model-based stereoscopic image sequence analysis and coding scheme (see Fig. 2), bothleft and right channels are coded using 3D rigid and non-rigid motion compensation. The approach taken isto de"ne fully 3D models of objects composed of interconnecting wire-mesh triangles. In this way, completeleft-to-right object correspondence is intrinsically established.

820 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 5: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

The left-to-right (LR) and right-to-left (RL) disparity "elds are estimated "rst, using a hierarchical dynamicprogramming disparity estimation procedure. The consistency of the computed disparity "elds is thenchecked, for the &points of interest' which are provided by the model initialisation procedure. A reliabledisparity estimate is obtained for those points of interest with inconsistent left}right disparities. An initialforeground/background segmentation procedure follows, leading to a 3D wireframe model adapted to theforeground object using the reliable disparity estimates.

In order to improve rigid 3D motion estimation the foreground object is articulated producing sub-objectsde"ned by the homogeneity of their 3D motion. This rigid 3D motion is estimated using least median ofsquares minimisation of a cost function taking into account the reliability of the projected rigid 3D motion inboth the left and right channels. Neighbourhood constraints are also imposed to improve the reliability of themotion estimation procedure.

Following articulation, the rigid 3D motion of each sub-object produced is estimated using the samemotion estimation procedure, without this time imposing neighbourhood constraints. Finally, non-rigidmotion of each node of the wireframe is estimated from the rigid 3D motion of the wireframe trianglescontaining this node as a vertex.

A block diagram of the proposed encoder is shown in Fig. 3. Its constituent components are described indetail in the ensuing sections.

4. Depth estimation-scene segmentation and model initialisation

4.1. Disparity/depth estimation

Since the stereo camera con"guration is known, the depth estimation problem reduces to that of disparityestimation [3,28,30]. A dynamic programming algorithm, minimising a combined cost function for twocorresponding lines of the stereoscopic image pair is used for disparity estimation. The basic algorithmadapts the results of [5,24] using blocks rather than pixels. Furthermore, a novel hierarchical version of thisalgorithm is implemented so as to speed up its execution. The cost function takes into consideration thedisplaced frame di!erence (DFD) as well as the smoothness of the resulting vector "eld in the following way.

Due to the epipolar line constraint [3] the search area for each pixel p3"(i

3, j3) of the right image is an

interval Sp3on the epipolar line in the left image determined by a minimum and maximum allowed disparity.

If p-"(i

-, j-)3Sp

3is the pixel in the left image matching with pixel p

3of the right image and if dp

3is the

disparity vector corresponding to this match, the following cumulative cost function is minimised withrespect to dp

3for the path ending at the pixel p

3in each line i

3of the right image:

C(i3)"min

d(p3)

MC(i3!1)#c(p

3, dp

3)N. (4)

The cost function c(p3, dp

3) is determined by

c(p3, dp

3)"R(p

3)DFD(p

3, dp

3)#SMF(p

3, dp

3). (5)

The "rst term in Eq. (5) contains the absolute di!erence of two corresponding image intensity blocks,centered at the working pixels (k, i) and (l, j) in the right and left images, respectively,

DFD(p3, dp

3)" +

(X,Y)|WEI

3(i3#X, j

3#>)!I

-(i-#X, j

-#>)E, (6)

where W is a rectangular window. Multiplication with the reliability function R(d) relaxes the DFD weight,keeping only the second term active in homogeneous regions where the matching reliability is small. The

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 821

Page 6: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 3. A block diagram of the proposed encoder.

disparity vector is considered reliable whenever it corresponds to a pixel on an edge or in a highly texturedarea. For the detection of edges and textured areas a variant of the technique in [8] was used, based on theobservation that highly textured areas exhibit high local intensity variance in all directions while on edges theintensity variance is higher across the direction of the edge. The second term in Eq. (5) is the smoothingfunction,

SMF(dp3)"

N+n/1

DDdp3!d

nDDR(d

n), (7)

822 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 7: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

where dn, n"1,2, N, are vectors neighbouring d. Multiplication by the factor R(d

n) aims to attenuate the

contribution of unreliable vectors to the smoothing function. Finally, the dynamic programming algorithmselects as the best path up to that stage, the one with the minimum cumulative cost (Eq. (4)).

A hierarchical version of this approach was utilised in order to speed up the estimation process and toproduce a smooth disparity "eld without discontinuities. In this version, the dynamic programmingalgorithm is applied at the coarse resolution level and an initial estimate for the disparity vectors is produced.The disparity information is then propagated to the next resolution level where it is corrected so that the costfunction is further minimised. This process is iterated until full resolution is achieved.

Along with the dense disparity "eld, the variance of the disparity estimate for each pixel of the image is alsocomputed, using

p%(p

3, dp

3)"

1

N2

N+

k/~N

N+

l/~N

(I3(i3#k, j

3#l)!I

-(i-#k, j

-#l))2, (8)

where (2N#1)](2N#1) is the dimension of the rectangular window W. Finally, depth is estimated fromdisparity, using the camera geometry as in [32].

4.2. Foreground/background segmentation

The model is initialised by separating the body in the videoconference scene from the background using aninitial foreground/background segmentation procedure. The depth map produced by the method in Sec-tion 4.1 applied to the full resolution image may be used for this purpose.

However, to reduce as much as possible the e!ects of errors in depth estimation, we propose instead the useof a hierarchical foreground/background segmentation, focused on the determination of only the largestdisparity vectors. These vectors correspond to foreground objects (objects that lie very close to the camera).This information is propagated to the higher resolution level where it is corrected in a coarse to "ne manner.Thus, by carefully selecting the search area of the disparity estimator at each resolution an initial fore-ground/background segmentation mask is formed.

The resulting segmentation map is then post-processed using a motion detection mask and the luminanceedge information. The motion detection mask is de"ned by simple subtraction of consequent frames of thesame channel of the image sequence. Note that in this phase, the aim is not to calculate motion accurately,but rather to identify regions with very high or very low motion. The motion detection mask containsimportant information for both inner and boundary areas of the foreground object while luminance edgeinformation carries important information about errors that occur mainly on the silhouette (border) of theforeground object. The foreground object boundary is found as the part of the image where both the depthgradient and the luminance gradient are high.

Summarising, the following algorithm is used for foreground/background separation, as shown inFig. 4:

f The disparity information in level l of the algorithm is segmented using a histogram based segmentationalgorithm and areas corresponding to large disparity values, are identi"ed as objects close to the camera.

f The segmentation information is propagated to the "ner resolution level where it is corrected appropri-ately.

f In the full resolution level, the resulting segmentation mask is post-processed using motion and luminanceinformation as follows: each portion of the scene designated as background by the disparity segmentationprocedure is reexamined in view of its motion u and depth and luminance gradients g (Fig. 4). If all theseparameters exceed preselected thresholds, this portion of the scene is con"rmed as being part of theforeground. Otherwise, it is relegated to the background.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 823

Page 8: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 4. The proposed foreground/background estimation scheme.

A 3D wireframe is adapted to the foreground object produced by the described procedure. Then, using thereliable depth estimates as described in the following sections, the "nal 3D model is created.

4.3. Consistency checking and disparity evaluation for the points of interest

A set F of points of interest is "rst de"ned, composed of points in the 3D space with left or right imageprojections located on depth and luminance edges. The latter are extracted using the edge detectionalgorithm presented in Section 4.1. For each of these points, the disparity estimation algorithm producesleft-to-right (LR) and right-to-left (RL) disparity "elds. However, the LR and RL disparity "elds may be

824 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 9: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

inconsistent because of occlusions and errors in the disparity estimation procedure. Thus, a consistencychecking algorithm is used to indicate the correct matches followed by an averaging procedure (Kalmanestimate) which assigns a depth value to pixels with inconsistent matches.

More speci"cally, the correspondence between pixels p3"(i

3, j3) of the right image and p

-"(i

-, j-) of the left

image is considered consistent if

d(3-)(p3)"!d(-3)(p

3#d(3-)(p

3)).

If the above relation is not valid, a more reliable depth estimate must be assigned to that pixel. The method in[9] is applied to this e!ect, using the reliability of the disparity estimates as a weighting function. Speci"cally,the disparity d(3-)(p

3) and the disparity d(-3)(p

-) satisfying

p3"p

-#d(-3)(p

-), (9)

are averaged with respect to their disparity error variances as follows:

dK (3-)"d(3-)p2

-3!d(-3)p2

3-p2-3#p2

3-

, p2$K"

p2-3p23-

p2-3#p2

3-

,

where p2-3

and p23-

are, respectively, the variances of the disparity estimates d(-3) and d(3-), computed at thedisparity/depth estimation stage using Eq. (8), and p2

$Kis the variance of the averaged disparity. If more than

one disparity vectors d(-3)(p-) satisfy Eq. (9), the one with the minimum estimation variance p2

-3is selected. The

consistency checking algorithm is applied to the set of all points of interest, selected as above so as to haveprojections on depth and luminance edges, and reliable depth estimates for pixels with either consistent orcorrected disparity are obtained to be used for model initialisation. The result of this procedure is a set F ofpoints of interest (xL

i, yL

i, zL

i) whose projections are located on the foreground depth map and luminance edges

of either the left or right camera and zLiis their estimated depth.

5. Initial 3D model adaptation

For the generation of the 3D model object, depth information must be modeled using a wire mesh. Weshall generate a surface model of the form [17]

z"S(x, y, PK ), (10)

where PK is a set of 3D &control' points or &nodes' PK "M(xi, y

i, z

i), i"1,2,NN that determine the shape of the

surface and (x, y, z) are the coordinates of a 3D point. An initial choice for PK is the regular tessellation shownin Fig. 5(a). The consistency checking algorithm, described in the previous section, is applied to all controlpoints to assign corrected depth values to every node of the 3D model. Automatic adaptation of the 3Dmodel (Fig. 5(c) and (d)) to the foreground object is sought by forcing the 3D model to meet the boundary ofthe foreground/background segmentation mask (Fig. 5(b)).

A set of reference image points G"M(xJi, yJ

i, zJ

i), i"1,2,QN is de"ned as the aggregate of F and PK :

G"FXPK , (11)

where F is the set of points of interest de"ned in the preceding section and PK are the nodes of the 3D modelwith the corrected depth values. Then, S can be modelled by a piecewise linear surface, consisting of adjointtriangular patches, which may be written in the form

z"z1g1(x, y)#z

2g2(x, y)#z

3g3(x, y), (12)

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 825

Page 10: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 5. (a) Initial triangulation of the image plane. (b) Foreground/background segmentation. (c) Part of the initial triangulationcorresponding to the foreground object. (d) Expanded wireframe adapted to the foreground object. (e) Barycentric coordinates.

826 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 11: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

if (x, y, z) is a point on the triangular patch P1PY2P

3"M(x

1, y

1, z

1), (x

2, y

2, z

2), (x

3, y

3, z

3)N. The functions g

i(x, y)

are the barycentric coordinates of (x, y, z) relative to the triangle and they are given bygi(x, y)"Area(A

i)/Area(P

1PK2P

3) (Fig. 5(e)).

The reconstruction of a surface from consistent sparse depth measurements may be e!ected by minimisinga functional of the form

E0(PK )"

Q+i/1

(S(xJi, yJ

i, PK )!zJ

i)2. (13)

The value of sum (13) expresses con"dence to the reference points (xJi, yJ

i, zJ

i)3G, i"1,2,Q.

Note that no smoothness constraint is imposed to the surface of the 3D model, since the depth estimatesfor these points are considered very reliable. Replacing Eq. (12) in Eq. (13) yields

E0(PK )"EAPK !BE2, (14)

where A is a Q]N matrix and B a Q]1 vector given by

Aij"G

gj(xJ

i, yJ

i) if (xJ

i, yJ

i) inside triangle M(x

j, y

j), (x

k, y

k), (x

l, y

l)N, i"1,2, Q,

0 otherwise, j"1,2, N,

Bi"zJ

i, i"1,2, Q,

where (i, j) are two triangle indices. The vector PK minimising Eq. (14) is

PK "(ATA)~1ATB, (15)

which de"nes the nodes of the wire-mesh surface. Using Eq. (12), the depth z of any point on a patch can beexpressed in terms of the depth information of the nodes of the wireframe and the X and > coordinates ofthat point. Hence, full depth information will be available if only the depths of the nodes of the wireframe aretransmitted.

6. Object articulation

A novel subdivision method based on the rigid 3D motion parameters of each triangle and the errorvariance of the rigid 3D motion estimation is proposed for the articulation of the foreground object(separation of the head and shoulders).

The model initialisation procedure described above, results in a set of interconnected triangles in the 3Dspace: M¹

k, k"1,2, KN where K is the number of triangles of the 3D model. In the following, S(i) will

denote an articulation of the 3D model at iteration i of the articulation algorithm, consisting ofMs(i)

k, k"1,2, M(i)N sub-objects. The proposed iterative object articulation procedure is composed of the

following steps:

Step 1. Set i"0. Let an initial segmentation S(0)"Ms(0)k

, k"1,2, KN, with s(0)k"¹

k. Let also the initial

neighbourhood for each triangle to be empty, i.e. ¹S(0)k"M N.

Step 2. Apply the 3D rigid motion estimation algorithm to each triangle ¹k, taking into account the

neighbourhood constraint imposed by the neighbourhood ¹S(i)k. This constraint is described in

detail in the Section 6.1 that follows.Step 3. Set i"i#1. Execute the object segmentation procedure that subdivides the initial object into

M(i) sub-objects, i.e. S(i)"Ms(i)k, k"1,2,M(i)N.

Step 4. Use the segmentation map S(i) to de"ne the new neighbourhood ¹S(i)k

of each triangle ¹k.

Step 5. If S(i)"S(i~1) then stop. Else go to step 2.

The proposed algorithm can be also explained by the example of Fig. 6(a)}(c). Fig. 6(a) illustrates theinitial phase of the algorithm where each triangle is treated as an object. The estimated rigid 3D motion

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 827

Page 12: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 6. (a) Initial phase of the object articulation algorithm. (b) The output of the rigid 3D motion estimation procedure for eachtriangle. (c) The output of the object segmentation procedure. (d) Non-rigid 3D motion estimation example. The light grey vectorrepresents the rigid motion of the working node while the black vectors represent estimates for the motion of the same node using the 3Dmotion parameters corresponding to each triangle containing the working node.

828 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 13: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

vectors of each triangle, computed at the second step of the proposed algorithm are shown in Fig. 6(b) andthe output of the object segmentation procedure of step 3 is shown in Fig. 6(c). Based on this new objectsegmentation map the rigid 3D motion estimation and object segmentation procedures are then furtherre"ned iteratively.

The 3D motion estimation of each triangle and the object segmentation procedure are described in moredetail below.

6.1. Rigid 3D motion estimation of small surface patches

The foreground object of a typical videophone scene is composed of more than one sub-objects (head,neck, shoulders, etc.) each of which exhibits di!erent rigid 3D motion. Thus, object articulation has to becompleted and the rigid motion of each sub-object must be estimated.

For rigid 3D motion estimation of each triangle ¹k

we use least median of squares minimisation. Thisprocedure removes the outliers from the initial data set and "nds the estimate that minimises the median ofthe square error. More speci"cally, the rigid motion of each triangle ¹

k, k"1,2,K, where K is the number

of triangles in the foreground object, is modeled using a linear 3D model, with three rotation and threetranslation parameters [1]:

Cx(t#1)

y(t#1)

z(t#1) D"C1 !w(k)

zw(k)

yw(k)z

1 !w(k)x

!w(k)y

w(k)x

1 D Cx(t)

y(t)

z(t) D#Ct(k)xt(k)yt(k)zD , (16)

where (x(t),y(t), z(t)) is a point on the plane de"ned by the coordinates of the vertices of triangle ¹k.

Since the triangle motion is to be used for object articulation, neighbourhood constraints are needed forthe estimation of the model parameter vector a(k)"(w(k)

x, w(k)

y, w(k)

z, t(k)

x, t(k)

y, t(k)

z), in order to guarantee a smooth

estimated triangle motion vector "eld, that can be successfully segmented.Let N

kbe the ensemble of the triangles neighbouring the triangle ¹

k. If triangle ¹

kbelongs to the region

s(i)l

of S(i) at iteration i of the object articulation algorithm, we de"ne as neighbourhood ¹S(i)k

of each triangle¹

kthe set of triangles ¹

jin

¹S(i)k"M¹

j3N

kNWM¹

j3s(i)

lN.

For example, in order to de"ne the neighbourhood of triangle A in Fig. 6(c) we "rst consider all triangles thatshare at least one common vertex with triangle A (i.e. N

k"MB,C, D,E,H, I,J,KN). From the set N

konly the

triangles belonging to the same object with triangle A, are "nally de"ned as neighbourhood of triangle A (i.e.¹S(i)

k"MB,C,D, EN). Then for each triangle ¹

kthe set of points belonging to ¹S(i)

kare input to the 3D rigid

motion estimation procedure so as to smooth the motion "eld produced.

6.2. The 3D motion estimation algorithm

For the estimation of the model parameter vector a(k)"(w(k)x

, w(k)y

,w(k)z

, t(k)x

, t(k)y

, t(k)z

) for each neighbourhood¹S(i)

kat iteration i of the object articulation procedure, the MLMS iterative algorithm [27] was used. The

MLMS algorithm is based on median "ltering and is very e$cient in suppressing noise with a large amountof outliers (i.e. in situations where conventional least-squares techniques usually fail).

As noted in the previous sections, the 3D motion of each extended neighbourhood ¹S(i)k

of a triangle ¹kis

modelled in the global coordinate system by

P(t#1)"R(k)m

P(t)#T(k)m

, (17)

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 829

Page 14: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

where the matrix R(k)m

and the vector T(k)m

are de"ned from Eq. (16). Since initial motion estimates are availablein the left and right camera images, the rigid 3D motion must be projected on the left and right coordinatesystems. Using Eqs. (17) and (1),

Pc(t#1)"R(k)

mcPc(t)#T(k)

mc, (18)

where

R(k)mc"R

cR(k)

mRT

c(19)

and

T(k)mc"!R

cR(k)m

RTcTc#R

cT(k)m#T

c, (20)

where R(k)mc

and T(k)mc

are the 3D motion rotation and translation matrices corresponding to camera c andtriangle k. Using the fact that the matrices R(k)

mcand T(k)

mcare of the form

R(k)mc"C

1 !w(k)zc

w(k)yc

w(k)zc

1 !w(k)xc

!w(k)yc

w(k)xc

1 D , T(k)mc"C

t(k)xct(k)yct(k)zcD , (21)

and also using Eqs. (2) and (3), the projected 2D motion vector in camera c, dc(X,>) is given by

dxc(X(t),>(t))"f

!w(k)xc

xc(t)y

c(t)#w(k)

yc(x

c(t)2#z

c(t)2)!w(k)

zcyc(t)z

c(t)#t(k)

xczc(t)!t(k)

zcxc(t)

(!w(k)yc

xc(t)#w(k)

xcyc(t)#z

c(t)#t

zc)z

c(t)d

x

, (22)

dyc(X(t),>(t))"f

w(k)xc

(yc(t)2#z

c(t)2)!w(k)

ycxc(t)y

c(t)!w(k)

zcxc(t)z

c(t)!t(k)

yczc(t)#t(k)

zcyc(t)

(!w(k)yc

xc(t)#w(k)

xcyc(t)#z

c(t)#t

zc)z

c(t)d

y

, (23)

where dc(X,>)"(d

xc(X(t),>(t)), d

yc(X(t),>(t))).

Using the initially estimated 2D motion vectors corresponding to the left and right cameras and Eqs. (22)and (23) along with Eqs. (19) and (20) evaluated for c"l and c"r, a linear system for the global motionparameter vector a(k) for triangle ¹

kis formed. Note that the parameters of a(k) are implicitly contained in

Eqs. (22) and (23), since a(k)c"(w(k)

xc,w(k)

yc, w(k)

zc, t(k)

xc, t(k)

yc, t(k)

zc) and a(k) are related by Eqs. (19) and (20). This is

a system of 2(¸-#¸

3) equations with six unknowns, where ¸

-,¸

3are the number of reference points in the set

G of (Eq. (11)), contained in the plane de"ned by the coordinates of the vertices of triangle k in the left andright image planes, respectively. If ¸

-#¸

3*2 this is overdetermined and can be solved using least-squares

methods or alternately, by the robust least median of squares motion estimation algorithm described in detailin [27]. The reference points initially chosen should be enough to guarantee ¸

-#¸

3*2 for each triangle. As

explained in Section 5, this is ensured by choosing in Eq. (11) as reference points all triangle vertices plus thepoints of interest on depth and luminance edges.

6.3. Object segmentation

At each iteration of the object articulation method, the rigidity constraint imposed on each rigid objectcomponent is exploited. This constraint requires that the distance between any pair of points of a rigid objectcomponent must remain constant at all times and con"gurations. Thus, the motion of a rigid model objectcomponent represented by a mesh of triangles can be completely described by using the same 6 motionparameters. Therefore, to achieve object articulation, neighbouring triangles which exhibit similar 3Dmotion parameters are clustered into patches. In an ideal case, these patches will represent the completevisible surface of the moving object components of the articulated object.

830 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 15: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

More speci"cally, the following iterative algorithm is proposed:

Step 1. Set j"0. Set M(i)0"K. Set S(i)

j"S(0).

Step 2. For each patch s(i)k

for k"1,2,M(i)j

execute the following clustering algorithm.Step 3. Set s@(i)

k"s(i)

kXs(i)

l. For all the patches s(i)

lthat belong to the neighbourhood of s(i)

k, if

Kp2s l(i)

a(sk(i))#p2sk(i)

a(s l(i))

p2sk(i)#p2

s l(i)!a(sk{(i))K)th,

cluster s(i)l

to s(i)k

and set s(i)k"s@(i)

kand M(i)

j"M(i)

j!1.

In the above a(sm(i)), m"k, l are the motion parameters, p2sm(i)

, m"k, l is the variance of the 3D motionestimate, i.e. the Displaced Frame Di!erence (DFD) of patch s(i)

mcomputed by compensating the projected 3D

motion in the left and right cameras and th is a threshold. Also,

p2sm(i)"

1

N2sm(i)

+P(t)|sm(i)

(I(t)-(P(t))!I(t`1)

-(P(t#1)))2#

1

N2sm(i)

+P(t)|sm(i)

(I(t)3(P(t))!I(t`1)

3(P(t#1)))2,

where Nsm(i)

is the number of points contained in patch s(i)m

and P(t) and P(t#1) are two corresponding pointsin time instants t and t#1, respectively.

Step 4. Set j"j#1 and M(i)j"M(i)

j~1. Set S(i)

j"Ms(i)

k, k"1,2,M(i)

jN. If S(i)

j"S(i)

j~1stop. Else go to step 2.

7. 3D motion estimation of each sub-object

7.1. Rigid 3D motion estimation of each sub-object

The object articulation procedure identi"es a number of sub-objects of the 3D model object, as areas withhomogeneous motion. A sub-object s

krepresents a surface patch of the 3D model object consisting of

Nsk

control points and q(sk) triangles. A sub-object may consist of q(sk)"1 triangle only. The motion of anarbitrary point P(t) on the sub-object s

kto its new position P(t#1) is described by

P(t#1)"R(sk)P(t)#T(sk), (24)

where k"1,2, M, and M is the number of sub-objects, where as before [1]:

R(sk)"C1 !w(sk)

zw(sk)y

w(sk)z

1 !w(sk)x

!w(sk)y

w(sk)x

1 D , T(sk)"Ct(sk)xt(sk)yt(sk)zD .

For the estimation of the model parameter vector a(sk)"(w(sk)x

, w(sk)y

, w(sk)z

, t(sk)x

, t(sk)y

, t(sk)z

) the MLMS iterativealgorithm described earlier is used, this time without imposing neighbourhood constraints.

7.2. Non-rigid 3D motion estimation

The rigid motion of the articulated objects cannot compensate errors occurring due to local motion (suchas due to movement of eyes and lips). These errors can only be compensated by deforming appropriately thenodes of the wireframe, in order to also follow the local motion. An analysis-by-synthesis approach isproposed for the computation of the non-rigid motion *J

iat node i, which minimises the DFD between the

image frame at time t#1 and the 3D non-rigid motion compensated estimate of frame t#1 from frame t.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 831

Page 16: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

2This sequence were prepared by the THOMPSON BROADCASTING SYSTEMS for use in the DISTIMA RACE project.

More speci"cally, the 3D motion ri, i"1,2,N, of the wire-mesh nodes is governed by (24). Alternative

estimates of the motion of the same node are provided by applying to (16) the 3D motion parametersoriginally estimated for a triangle containing node i (see Fig. 6(d)). Since the motion of each triangle re#ectsboth global rigid motion and local deformations, the di!erence of these two estimates of the motion of eachnode may be assumed to approximate the non-rigid motion component of the node. If r

iis the rigid 3D

motion of node i and *(k)i

, k"1,2, Ni, are the estimates for the motion of node i produced by rotating and

translating this node with the rotation and translation parameters corresponding to each triangle¹

kcontaining node i, we de"ne as candidates for the minimisation of the DFD of the reconstruction error

*J (k)i"*(k)

i!r

i, k"1,2,N

i, (25)

where Niis the number of neighbourhood triangles of node i. The "nal non-rigid motion vector *J

iis chosen to

be

*Ji"arg min

k|1,2,Ni

(DFD-(*I (k)

i)#DFD

3(*J (k)

i)),

where

DFDc(*J (k)

i)"

1

NRi

+Ri

(I(t)c(P(t))!I(t`1)

c(P(t)#r

i#*J (k)

i))2.

In the above equation, P(t) are the 3D coordinates of node i at time instance t and P(t)#ri#*J (k)

iare the

corresponding corrected coordinates at time instance t#1 corresponding to the *J (k)i

non-rigid motion vector.The intensities I at time instances t and t#1 are calculated for cameras c"l, r over a region R

ide"ned as the

aggregate of the planes of all triangles containing node i, and NRi

is the number of points contained in region Ri.

8. Experimental results

The proposed model-based analysis and coding method was evaluated for the right and left channels ofa stereoscopic image sequence. The "rst frame of both channels is transmitted using intra frame codingtechniques, as in H263 [20]. The performance of the proposed methods was investigated in application to thecompression of the interlaced stereoscopic videoconference sequence &Claude' of size 360]288.2

The hierarchical dynamic programming procedure for matching across the epipolar line, described inSection 4.1, with 2 levels of hierarchy was used for LR and RL disparity/depth estimation. The search areafor disparity was chosen to be $62 and $2 half pixels for the x and y coordinate, respectively. Fig. 7(b) and(d) show the computed left and right channel depth maps using the hierarchical dynamic programmingapproach. The depth map has the same resolution as the original image (since it is computed by a densedisparity "eld). Depth information is quantised to 256 levels. Darker areas represent objects closer to thecameras. The smoothing properties of the dynamic programming method are seen to result in more realisticdepth-map estimates.

Foreground/background separation is performed next, using the coarse to "ne technique described inSection 4.2. The motion detection mask along with the luminance edge information are then used to improvethe results of the initial segmentation. The resulting foreground/background mask of &Claude' is shown inFig. 5(b).

832 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 17: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 7. (a) Original left channel image &Claude' (frame 2). (b) Corresponding depth map estimated using dynamic programming. (c)Original right channel image &Claude' (frame 2). (d) Corresponding depth map estimated using dynamic programming.

The LR and RL disparity estimates are then subjected to the consistency checking procedure andinconsistent matches are corrected by reliably fusing LR and RL information as described in Section 4.3. Onthe basis of the consistent and the corrected depth information at all reference points, the wireframe model isadapted to the foreground object (Fig. 5(c) and (d)).

The rigid 3D motion of each triangle of the foreground object is computed next, using the techniquedescribed in Section 6.1. The output of the proposed local 3D motion estimator was a set of 3D motionparameters assigned to each triangle of the wireframe 3D model. In order to show the resulting local 3Dmotion we have produced a visualization of the rotation and translation parameters of the homogeneousmotion matrix. For the rotation parameters, the direction of the vector assigned to each triangle shows therotation axis and the size of the vector as well as the color of the triangle show the magnitude of the angle of

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 833

Page 18: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 8. (a) Visualization of the rotation parameters of the rigid 3D motion for each triangle of the 3D model of &Claude'. (b) Visualizationof the translation parameters of the rigid 3D motion for each triangle of the 3D model of &Claude'.

the rotation. For the translation parameters, the direction of the vector at each triangle shows the direction ofthe local 3D motion, while the size of the vector as well as the color of the triangle show the magnitude oftranslation. The visualisation of the rotation and the translation parameters of the rigid 3D motion of eachwireframe triangle for &Claude' are shown in Fig. 8(a) and (b), respectively. As is in this way demonstrated, thehead and the shoulders undergo di!erent motion. The resulting articulation of the foreground objectachieved by the methods of Section 6 is shown in Fig. 9(a). The accuracy of this object articulation isremarkable and is due to the fact that the foreground/background segmentation and the object articulationprocedures are de"ned on the 3D space in terms of triangle rather than pixel motion. In this way, completecorrespondence is achieved between objects in the right and left channel image.

Following object articulation, the algorithm presented in Section 7.1 is used for rigid 3D motionestimation. The computed motion parameter vectors, between frames 1 and 2, for the head and shoulderssub-objects are shown in Fig. 9(b). As seen, the 3D motion of the &shoulders' sub-object is negligible while the3D motion of the &head' has signi"cant rotation and translation parameters (this can also be observed byexamining the original frames 1 and 2 of &Claude'). The performance of the algorithm in terms of PSNR isevaluated in Tables 1 and 2 where the quality of the reconstruction of the whole image as well as only the&head' or &shoulders' sub-objects is presented.

Fig. 10(a) and (c) show the reconstructed left and right images, respectively, using rigid 3D motioncompensation while Fig. 10(b) and (d) show the corresponding prediction errors. The performance of thealgorithm in terms of PSNR is shown in Tables 1 and 2 where the quality of the reconstruction of the wholeimage as well as only the &head' or &shoulders' sub-objects is presented. As seen, rigid 3D motion is notsu$cient for very accurate reconstruction of the &head' sub-object, and thus non-rigid 3D motion must beused to improve the performance of the algorithm.

The analysis-by-synthesis technique presented in Section 7.2 is then used for non-rigid 3D motionestimation. The quality of the reconstruction in terms of PSNR is described in Tables 1 and 2 where animprovement of about 1 dB is seen to be achieved by non-rigid 3D motion compensation. Fig. 11(a) and (c)show details of the reconstruction error using 3D rigid motion compensation of the left and right images,respectively, while Fig. 11(b) and (d) show the corresponding prediction errors using 3D non-rigid motioncompensation.

834 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 19: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 9. (a) Articulation of the foreground object. (b) The 3D motion parameter vectors corresponding to the head and shouldersub-objects.

Table 1PSNR (dB) of the reconstruction of the left channel frame 2 of &Claude' using rigid and non-rigid 3D motion compensation

Object Rigid Non-rigid

Whole image 36.664870 37.664178Head 33.947095 35.675244Shoulders 39.055562 39.084458

Table 2PSNR (dB) of the reconstruction of the right channel frame 2 of &Claude' using rigid and non-rigid 3D motion compensation

Object Rigid Non-rigid

Whole image 36.138653 36.890786Head 33.977305 35.419320Shoulders 37.892319 37.904183

The proposed algorithm was also tested for the coding of a sequence of frames at 10 frames/s. The modeladaptation, depth estimation and object articulation procedures were applied only at the beginning of eachgroup of frames. Each group of frames consists of 10 frames. The "rst frame of each group of frames wastransmitted using intra frame coding techniques. In the intermediate frames the model and articulation

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 835

Page 20: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 10. (a) Reconstructed left channel of &Claude' using rigid 3D motion compensation. (b) The corresponding prediction error. (c)Reconstructed right channel of &Claude' using rigid 3D motion compensation. (d) The corresponding prediction error.

formation are self-adapted using the rigid and non-rigid 3D motion information. The only parameters thatneed to be transmitted are the 6 parameters of the rigid 3D motion and the 3D non-rigid motion vector foreach node of the wireframe. The methodology developed in this paper allows both left and right images to bereconstructed using the same 3D rigid motion vectors, thus achieving considerable bit-rate savings. Thecoding algorithm requires a bit-rate of 24.4 kbps and produces better image quality compared to a corre-spondingly simple block matching motion estimation algorithm [23], as shown in Figs. 12 and 13. Thesimple block matching approach is identical to that used in H263 and consists of absolute displaced framedi!erence minimization, by searching exhaustively within a search area of !15,2,15 half-pixels in theprevious in time frame, centered at the position of the examined block. In both coders, only the "rst frame of

836 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 21: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 11. (a) Detail of the reconstruction error of left channel of &Claude' using rigid 3D motion compensation. (b) The correspondingprediction error using non-rigid 3D motion compensation. (c) Detail of the reconstruction error of right channel of &Claude' using rigid3D motion compensation. (d) The corresponding prediction error using non-rigid 3D motion compensation.

Fig. 12. PSNR of each frame of the left channel of the proposed algorithm, compared with the block matching scheme with a block sizeof 16]16 pixels.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 837

Page 22: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

Fig. 13. PSNR of each frame of the right channel of the proposed algorithm, compared with the block matching scheme with a blocksize of 16]16 pixels.

each group of frames was transmitted using intra frame coding. It was also assumed that each frame waspredicted using the reconstructed previous frame, and that the prediction error was not transmitted. Thebit-rate required by this scheme with a 16]16 block size was 24.5 kbps.

9. Conclusions

In this paper we addressed the problem of rigid and non-rigid 3D motion estimation for model-basedstereo videoconference image sequence analysis and coding. On the basis of foreground/backgroundsegmentation using motion, depth and luminance information, the model was initialised by automaticallyadapting a wireframe model to the consistent depth information. Object articulation was then performedbased on the rigid 3D motion of small surface patches. Spatial constraints were imposed to increase thereliability of the obtained 3D motion estimates for each triangle patch. A novel iterative classi"cationtechnique was then used to obtain an articulated description of the scene (head, neck, shoulders). Finally,#exible motion of the nodes of the wireframe was estimated using a novel technique based on the rigid 3Dmotions of the triangles containing the speci"c node.

The results of the algorithm can be used in a series of applications. For <ideo Production and ComputerGraphics applications, the 3D motion of a speci"c scene could be used to produce a scene with similar motionbut di!erent texture, as when producing a video with a model mimicking the motion of an actor. The methodcan have also useful applications in Image Analysis since an analytic representation of the motion of theobject is given (either in triangle or in wireframe node level) that can be used for the segmentation orarticulation of the object into uniform moving rigidly components. Finally, the method was experimentallyshown to be e$cient for very low bit-rate coding of stereoscopic image sequences.

838 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840

Page 23: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

References

[1] G. Adiv, Determining three-dimensional motion and structure from optical #ow generated by several moving objects, IEEE Trans.on Pattern Analysis and Machine Intelligence 7 (July 1985) 384}401.

[2] K. Aizawa, H. Harashima, T. Saito, Model-based analysis synthesis image coding (MBASIC) system for a person's face, SignalProcessing: Image Communication 1 (October 1989) 139}152.

[3] S. Barnard, W. Tompson, Disparity analysis of images, IEEE Trans. Pattern Anal. Mach. Intell. 2 (July 1980) 333}340.[4] G. Bozdagi, A.M. Tekalp, L. Onural, 3-D Motion estimation wireframe adaptation including photometric e!ects for model-based

coding of facial image sequences, IEEE Trans. Circuits Systems Video Technol. (June 1994) 246}256.[5] I.J. Cox, S. Hingorani, B.M. Maggs, S.B. Rao, Stereo without regularization, tech. rep., NEC Research Institute, Princeton, USA,

October 1992.[6] I. Cox, S. Hingorani, S. Rao, B. Maggs, A maximum likelihood stereo algorithm, Comput. Vision, Graphics Image Process. (1995)

to appear.[7] J.L. Dugelay, D. Pele, Motion disparity analysis of a stereoscopic sequence. Application to 3DTV coding, EUSIPCO '92, October

1992, pp. 1295}1298.[8] W.L.O. Egger, M. Kunt, High compression image coding using an adaptive morphological subband decomposition, Proc. IEEE

83 (February 1995) 272}287.[9] L. Falkenhagen, 3D Object-based depth estimation from stereoscopic image sequences, in: Proc. Internat. Workshop on

Stereoscopic and 3D Imaging '95, Santorini, Greece, September 1995, pp. 81}86.[10] N. Grammalidis, S. Malassiotis, D. Tzovaras, M.G. Strintzis, Stereo image sequence coding based on three-dimensional motion

estimation compensation, Signal Processing: Image Communication 7 (August 1995) 129}145.[11] M. HoK tter, Object-oriented analysis}synthesis coding based on moving two-dimensional objects, Signal Processing: Image

Communication 2 (December 1990) 409}428.[12] M. HoK tter, Optimization and e$ciency of an object-oriented analysis-synthesis coder, Signal Processing: Image Communication

4 (April 1994) 181}194.[13] E. Izquierdo, M. Ernst, Motion/disparity analysis for 3D-video-conference applications, in: M.G.S. et al. (Eds.), Proc. Internat.

Workshop Stereoscopic and 3D Imaging, Santorini, Greece, September 1995, pp. 180}186.[14] R. Koch, Dynamic 3D scene analysis through synthesis feedback control, IEEE Trans. Pattern Anal. and Mach. Intell. 15 (June

1993) 556}568.[15] H. Li, A. Lundmark, R. Forchheimer, Image sequence coding at very low bitrates } a review, IEEE Trans. Image Process.

3 (September 1995) 589}609.[16] J. Liu, R. Skerjanc, Stereo and motion correspondence in a sequence of stereo images, Signal Processing: Image Communication

5 (October 1993) 305}318.[17] S. Malassiotis, M.G. Strintzis, Optimal 3D mesh object modeling for depth estimation from stereo images, in: Proc. 4th European

Workshop on 3D Television, Rome, October 1993.[18] G. Martinez, Shape estimation of moving articulated 3D objects for object-based analysis-synthesis coding (OBASC), in: Internat.

Workshop on Coding Techniques for Very Low Bit-rate Video, Tokyo, Japan, November 1985.[19] G. MartmHnez, Object articulation for model-based facial image coding, Signal Processing: Image Communication, (September

1996).[20] MPEG-2, Generic coding of moving pictures and associated audio information, tech. rep., ISO/IEC 13818, 1996.[21] H.G. Mussman, M. HoK tter, J. Ostermann, Object-oriented analysis}synthesis coding of moving images, Signal Processing: Image

Communication 1 (October 1989) 117}138.[22] H.G. Musmann, P. Pirsch, H.J. Grallert, Advances in picture coding, Proc. IEEE 73 (April 1985) 523}548.[23] A.N. Netravali, B.G. Haskell, Digital Pictures } Representation and Compression. Plenum Press, New York and London, 1988.[24] S. Panis, M. Ziegler, Object based coding using motion stereo information, in: Proc. Picture Coding Symposium (PCS '94),

Sacramento, California, September 1994, pp. 308}312.[25] D.V. Papadimitriou, Stereo in model-based image coding, in: Internat. Workshop on Coding Techniques for Very Low Bit-rate

Video (VLBV 94), Colchester, April 1994, p. 3.7.[26] L. Robert, R. Deriche, Dense depth map reconstruction using a multiscale regularization approach with discontinuities preserving,

in: M.G.S. et al. (Eds.), Proc. Internat. Workshop Stereoscopic and 3D Imaging, Santorini, Greece, September 1995, pp. 32}39.[27] S.S. Sinha, B.G. Schunck, A two-stage algorithm for discontinuity-preserving surface reconstruction, IEEE Trans. on PAMI 14

(January 1992).[28] A. Tamtaoui, C. Labit, Constrained disparity motion estimators for 3DTV image sequence coding, Signal Processing: Image

Communication 4 (November 1991) 45}54.[29] A. Tamtaoui, C. Labit, Symmetrical stereo matching for 3DTV sequence coding, in: Picture Coding Symp. PCS '93, March 1993.[30] D. Tzovaras, N. Grammalidis, M.G. Strintzis, Depth map coding for stereo and multiview image sequence transmission, in:

Internat. Workshop on Stereoscopic and 3D Imaging (IWS3DI'95), Santorini, Greece, September 1995, pp. 75}80.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 839

Page 24: 3D object articulation and motion estimation in model ... object articulation and motion estimation.pdf · Fig. 1. Stereoscopic camera geometry. estimation of the non-rigid motion

[31] D. Tzovaras, N. Grammalidis, M.G. Strintzis, 3-D motion/disparity segmentation for object-based image sequence coding, OpticalEngineering, special issue on Visual Communications and Image Processing 35 (January 1996) 137}145.

[32] D. Tzovaras, N. Grammalidis, M.G. Strintzis, Object-based coding of stereo image sequences using joint 3-D motion/disparitycompensation, IEEE Trans. Circuits Systems Video Technol. 7 (April 1997).

[33] D. Tzovaras, M.G. Strintzis, H. Sahinoglou, Evaluation of multiresolution block matching techniques for motion and disparityestimation, Signal Processing: Image Communication 6 (March 1994) 59}67.

[34] M. Ziegler, Digital stereoscopic imaging and application, a way towards new dimensions, the RACE II project DISTIMA, in: IEEColloq. on Stereoscopic Television, London, 1992.

840 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840