MRF–MAP–MFT visual object segmentation based on motion boundary field

Pattern Recognition Letters 24 (2003) 3125–3139

www.elsevier.com/locate/patrec

MRF–MAP–MFT visual object segmentationbased on motion boundary field

Jie Wei *, Izidor Gertner

Department of Computer Science, The City College of the City University of New York,

Convent Avenue at 138th Street, New York, NY 10031, USA

Received 31 March 2003; received in revised form 16 July 2003

Abstract

In our earlier work, a two-pass motion estimation algorithm (TPA) was developed to estimate a motion field for two

adjacent frames in an image sequence where contextual constraints are handled by several Markov random fields

(MRFs) and the maximum a posteriori (MAP) configuration is taken to be the resulting motion field. In order to

provide a trade-off between efficiency and effectiveness, the mean field theory (MFT) was selected to carry out the

optimization process to locate the MAP with desirable performance. Given that currently in the disciplines of digital

library [IEEE Trans. PAMI 18 (8) (1996); IEEE Trans. Image Process. 11 (8) (2002) 912] and video processing [IEEE

Trans. Circ. Sys. Video Tech. 7 (1) (1997)] of utmost interest are the extraction and representation of visual objects,

instead of estimating motion field, in this paper we focus on segmenting out visual objects based on spatial and

temporal properties present in two contiguous frames in the same MRF–MAP–MFT framework. To achieve object

segmentation, a ‘‘motion boundary field ’’ is introduced which can turn off interactions between different object regions

and in the mean time remove spurious object boundaries. Furthermore, in light of the generally smooth and slow

velocities in-between two contiguous frames, we discover that in the process of calculating matching blocks, assigning

different weights to different locations can result in better object segmentation. Experimental results conducted on both

synthetic and real-world videos demonstrate encouraging performance.

� 2003 Elsevier B.V. All rights reserved.

Keywords: Markov random field; Visual motion; Segmentation; Mean field theory

1. Introduction

In video compression schemes, apart from

taking advantage of spatial redundancies among

adjacent pixels, temporal redundancies between

* Corresponding author. Tel.: +1-212-650-5604; fax: +1-212-

650-6284.

E-mail address: [email protected] (J. Wei).

0167-8655/$ - see front matter � 2003 Elsevier B.V. All rights reserv

doi:10.1016/S0167-8655(03)00180-6

neighboring frames are exploited extensively toreduce the bandwidth required to save or transmit

video data. For each macro-block in the target

frame, a macro-block of the same size in the vi-

cinity of the corresponding position of the refer-

ence frame is located based on a certain energy

minimization criterion, e.g. minimal mean square

error (MMSE). A vector indicating the difference

between these two positions is saved in the motion

ed.

mail to: [email protected]

3126 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139

map as the corresponding motion vector or velocity,

whereas the pixelwise differences between these two

pixels/macro-blocks are stored in the residual map.

The coding result for the target frame thus en-

compasses these two maps. The process of search-

ing for the matching block of theMMSE is denotedby motion estimation and is understandably the

most time-consuming phase within this framework.

Most existing video compression schemes work

along this track, such as H.263, MPEG-1. There

are basically two avenues to enhance the perfor-

mance of this scheme: one is to make the process of

motion estimation more efficient, denoted by effi-

ciency-oriented schemes, e.g., three-step algorithm(Koga et al., 1993), and conjugate searching

(Srinivasan and Rao, 1985), wavelet-based esti-

mation (Zhang and Zafar, 1992). The other is to

generate a motion map which provides lower en-

tropy for both encoding maps, denoted by con-

textual motion schemes. Most methods proposed so

far can be classified into three categories: (1)

Parametric model based techniques are employedto retrieve object of motion (Wang and Adelson,

1994; Hager and Belhumeur, 1998; Borshukov

et al., 1997), here object shapes are extracted by use

of affine or full-perspective motion models. This

school of motion segmentation has received in-

tensive attentions recently among vision resear-

chers due to its solid three-dimensional modeling.

However, the inefficiency and sensitivity to noisehave been its major hurdle to be employed widely

in industrial applications. (2) Markov random field

(MRF) is used to model the contextual constraints,

and the maximum a posteriori (MAP) configura-

tion for the MRF corresponding to velocities is

computed to be the sought-after motion field. This

category of techniques is denoted by MRF–MAP

framework and has been extremely successful inachieving motion segmentation (Li, 1995; Zhang

and Hanauer, 1995; Memin and Perez, 1998). This

category of schemes also has much to be desired in

efficiency due to the fact that the search for MAP is

extremely time consuming. (3) Scene mosaic is

created as the representation of a given video (Irani

et al., 1995; Zoghlami et al., 1997). This method

takes a fresh new avenue to achieve video com-pression: efforts are made to find homographies

between overlapping frames in order to generate an

image mosaic with respect to the scene, the resul-

tant scene mosaic, a mere image, is then deemed

as the representation. Extremely impressive com-

pression results have been reported. However, this

scene mosaic idea is only able to achieve the best

results in cases where the scene commanded by acamera is a static background. Furthermore, its

efficacy is heavily dependent on the computed ho-

mographies whose computation is always unreli-

able. Therefore its applicability is greatly limited.

In this paper, we are about to develop a method in

the second category with improved efficiencies.

In (Wei and Li, 1999), we developed a two-pass

algorithm (TPA) within the MRF–MAP frame-work to estimate motion field with desirable per-

formance. Three novel techniques contribute to

the effectiveness and efficiency achieved by the

TPA:

1. A pre-processing step to partition a target

frame into three groups, namely, unpredictable,

uncertain, and predictable group, which corre-spond to macro-blocks which are occluded or

not. Our ensuing estimation process thus needs

only to be performed on uncertain and predict-

able sites.

2. Instead of inducing an explicit line field (LF) to

circumvent the universal interactions among

neighboring sites, a truncation function whose

threshold varies as the optimization proceedsis employed to enforce interactions between

neighboring blocks.

3. The mean absolute difference (MAD) instead of

the generally used mean square error (MSE) is

used as the criterion of energy resemblance/dif-

ference. The MAD criterion is far more resilient

to outliers than the MSE, its efficacy is substan-

tiated by the theory of robust statistics (Rous-seeuw and Leroy, 1987) and our extensive

experiments in both synthetic and real world

scenes.

The TPA outperforms other existing MRF–

MAP algorithms in its efficiency without much loss

of estimation precision.

As described in (Woo and Ortega, 1996), asidefrom achieving a better video or stereo image

compression by providing lower bit rates for

J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3127

motion maps, contextual motion schemes play

essential roles in ensuing video understanding

tasks due to the fact that it provides important

cues regarding the video�s visual contents. In

MPEG-4 (Zhang et al., 1997), a new concept vi-

sual object (VO) emerges as the central theme forrepresenting multimedia applications. Unlike pre-

vious MPEG standardization efforts, where the

fundamental unit upon which the coding process

works is pixels or macro-blocks which bear with

no perceptive meaning at all, VO in MPEG-4 is a

semantically meaningful unit a user is allowed to

access and manipulate. The upcoming MPEG-7

purports to provide a content representationstandard for visual content-based search in digital

libraries. VO is able to not only facilitate more

effective video compression, more importantly, it

can also find its utility in a broad spectrum of

applications such as video synthesis, virtual reality,

digital libraries (Picard and Pentland, 1996; Wei,

2002), etc. Therefore, VO has an essential role to

play in the development of state-of-the-art tech-niques, its retrieval and representation has received

more and more attention among academia and

industry (Change et al., 1997; Pentland et al., 1996;

Flickner et al., 1995).

VO in its own right is a semantically meaningful

concept, or referred to as high-level knowledge in

terms of Marr (1982), which can only be arrived at

by human intelligence, perceptive process to bemore specific (Yarbus, 1967). However, techniques

currently used in video processing and computer

vision belong to the low- or intermediate-level

processing. This is a sensing process. There is a

great conceptual gap between these two processes.

More information or a priori knowledge about the

VO is required to bridge them closer. For a still

image, it is impossible to extract VO without cor-responding prescribed knowledge. Indeed even for

human beings it is virtually impossible to distin-

guish a well-camouflaged fly from the background.

However, for a video, given that contiguous

frames are of similar visual contents, the generally

different velocities of the background and objects

provide strong cues for the extraction of VOs. For

instance, it incurs little difficulty for human beingsto find a moving fly no matter how perfect its

camouflage is. In this paper, we attempt to exploit

motion information present in image sequences for

the purpose of extracting VO which is of different

motion from the background. We will work in the

same framework as the TPA, namely, using MRF

to model the contextual constraints and MFT in

search of the MAP configuration, it is referred toas the MRF–MAP–MFT technique in the sequel.

In the TPA, adjacent sites whose interactions

are turned off by a truncation function are likely to

be the boundaries of a VO from the background.

However, the TPA gives no consideration to

interactions between those boundary locations

which is problematic in arriving at a reasonable

boundary description of VOs. In the seminal paperof Geman and Geman (1984) which addressed

issues regarding the image enhancements, a novel

concept LF is introduced to circumvent universal

interactions between spatially adjacent sites. The

LF is a dual field of the original image pixel sites,

i.e., there exists exactly one site in the LF in-

between two adjacent pixels in the original image

along both the horizontal and the vertical direc-tions. The LF is a boolean field: if the intensity

value of two neighboring sites in the original image

surpasses a given threshold, the corresponding site

on LF is assigned one; zero otherwise. The inter-

actions among neighboring LF sites are embedded

in search of the MAP configuration to provide a

desirable geometrical structure of boundaries of a

spatial region. The introduction of the LF is hailedas a field-breaking novel technique and its useful-

ness has been substantiated by a wide range of

successful applications in disciplines such as

image/video processing and computer vision (Bar-

nard, 1993; Krishnamachari and Chellappa, 1997;

Konrad and Dubois, 1992). However, as discussed

at length in (Wei and Li, 1999), literal usage of the

LF in the domain of video processing is prob-lematic due to two reasons: (1) detection of object

boundaries solely based on intensity difference is

not appropriate in that intensity discontinuities

generally do not signify object boundaries; (2) the

LF is not well defined for block-based or hierar-

chical estimation which is crucial to enhance the

processing efficiency. Therefore in video process-

ing it is difficult to take advantage of the benefitsprovided by the LF defined in the spatial domain.

In this paper we propose a novel concept motion


boundary field (MBF) as the analog of the LF in

video domain, which is able to capture both spatial

and the temporal contextual constraints in seg-

menting out VOs. In the proposed VO segmenta-

tion algorithm, the MBF is added to enforce the

desirable structure of VO boundaries which caneffectively remove spurious boundary sites and

create connected boundaries. One more improve-

ment achieved in this algorithm is the application

of the so-called slow and smooth principle of object

motion as described in the study on cognitive sci-

ence (Weiss and Adelson, 1998). This principle

stipulates that more priorities should be allocated

to sites of near-zero movement and in the meantime velocities on adjacent sites should be similar.

The resultant scheme of this principle is an as-

signment of different weights to positions in the

corresponding vicinity of the reference frame. Ex-

perimental results conducted on both synthetic

and real-world videos demonstrate encouraging

performance of the proposed VO segmentation

algorithm.This paper is organized as follows. The MBF

and slow and smooth principle in the MRF–

MAP–MFT framework are introduced and justi-

fied in the next section. The proposed visual object

segmentation model and algorithm is presented in

Section 3. Section 4 demonstrates the experimental

results of the proposed algorithm for both syn-

thetic and real-world videos. We conclude thispaper in Section 5 with more remarks and dis-

cussions.

1 The component of the velocity of a moving edge in the

direction of the edge cannot be determined uniquely.

2. Object segmentation using MRF–MAP–MFT

In this section, first the MRF–MAP–MFT

technique for motion estimation as developed in(Wei and Li, 1999) is outlined. Next the slow and

smooth movement phenomenon as formulated in

(Weiss and Adelson, 1998) is introduced. Finally,

the novel concept of MBF and its potential power

in object segmentation is presented.

2.1. MRF–MAP–MFT technique

In the video compression discipline, efficiency-

oriented schemes are generally employed (Jain and

Jain, 1981). Useful as they are, giving no consid-

eration to contextual constraints existing in the

spatial and temporal domain in a video, they fail

to generate satisfactory results due to the so-called

‘‘aperture problem’’ 1 and noise present in frames.

In order to obtain more refined motion vectorswhich are able to reflect visual contents of a video,

the MRF has been widely employed to model the

contextual interactions between adjacent sites (Li,

1995). Generally speaking, in the MRF the impact

posed to the value of the random variable on one

site, the pixel in case of images, by those on other

sites is constrained to its close neighbors, where

the ‘‘closeness’’ is determined by the definition ofthe neighborhood system Ni;j. The mathematical

formulation of Markovianity is stipulated as

below:

P ðfi;jjfSm;n�ði;jÞÞ ¼ P ðfi;jjfNi;jÞ: ð1Þ

Eq. (1) indicates the behavior of the random

variable on site ði; jÞ is only affected by those sites

in Ni;j. The maximum a posteriori (MAP) prob-

ability is most commonly used as a statistical cri-

terion for optimality and thus often chosen in

conjunction with the MRF in vision modeling (Li,

1995). The resulting framework, referred to as

MAP–MRF, is obtained.The MRF lends us a powerful tool in modeling

the local property for images. However, for the

purpose of image processing, the computation

based on the probability density functions (pdfs)

for each site is prohibitively intensive. Due to the

equivalence of the MRF to Gibbs random field

(GRF) (Geman and Geman, 1984; Hammersley

and Clifford, 1971), the joint probability P ðf Þ ofthe configuration f is defined as below:

P ðf Þ ¼ exp½�bUðf Þ�=Z; ð2Þwhere Z is a normalizing term called partition

function:

Z ¼Xf

exp½�bUðf Þ�; ð3Þ

and Uðf Þ is the sum of all possible clique poten-

tials:


Uðf Þ ¼Xc2C

Vcðf Þ; ð4Þ

where a clique c is a set of sites which are neigh-

bors to each other under a neighborhood system

N , Vcðf Þ is the clique potential for c. Now it is

evident that Pðf Þ is determined entirely by thenature of its local characteristics, i.e., the choice of

the neighborhood system and specific clique po-

tentials which nail down the local interactions

among adjacent sites. Therefore, one can define a

neighborhood system and assign values to those

corresponding clique potentials to reflect one�sprior belief of contextual constraints. The most

possible result obtained from this prior is theMAP, which is the one having the minimal energy

Uðf Þ according to Eq. (2). Thereby a rich arsenal

of optimization methods can be used in search of

the MAP.

There are mainly two categories of techniques

in conducting the optimization process, one is

local methods, the other is global methods (Li,

1995). The common feature shared by all localmethods is that the computation converges rap-

idly, but they are susceptible to getting stuck to

local minima. For global methods, the global

minima can be guaranteed, however, often-times

at the cost of extremely high computational cost.

Mean field theory (MFT), originally an optimiza-

tion method developed in statistical mechanics

(Chandler, 1987), is found to be able to provide adesirable trade-off between performance and effi-

ciency (Zhang, 1992; Wei and Li, 1999). In short,

with the MFT techniques, for one site p, instead of

computing interactions with all its neighboring

particles, one can first compute the mean field

generated by its adjacent sites and then evaluate

the impact exerted to p by this field. Mathemati-

cally the mean field hfi;ji for fi;j is defined as below:

hfi;ji ¼Xfi;j

fi;jP ðf Þ: ð5Þ

The MFT turns the many-body statistical me-

chanics problem into a one-body problem and

thus resulting in reduced computational cost and

relatively desirable performance.

Therefore, the MRF–MAP–MFT technique

consists of three components: (a) use MRF to

model contextual constraints; (b) search for the

MAP as the anticipated result; (c) employ the

MFT technique in the MAP search.

2.2. Slow and smooth motion modeling scheme

As briefed in the first section, with a wealth of

visual cues, video data provides far more informa-

tion, motion especially, than still images which can

be exploited to segment out objects. Visual content

analysis based on spatial properties and motion

have an essential role to play. The behavior of

human vision System (HVS) has inspired many

seminal theories and techniques in computer visionand image processing, such as active vision (Ballard,

1991; Aloimonos et al., 1988) which purports to

emulate the spatial configuration and eye move-

ments of the HVS to actively involve cameras in the

sensing process, and color image indexing (Swain

and Ballard, 1991; Healey and Slater, 1994; Drew

et al., 1998) which takes advantage of the perceptive

capacity of the HVS to provide drastic data re-duction and effective object recognition. To achieve

desirable motion analysis result, it is necessary to

first gain deep insights into the corresponding

workings of the HVS in conducting motion analy-

sis. An excellent study from the perspective of

cognitive science is presented in (Weiss and Adel-

son, 1998), which sheds light on possible avenues

for us to take in order to enhance our motionanalysis procedure. It is pointed out that due to the

inherent ambiguity of local motion signals, such as

aperture problem, in the process of motion per-

ception, a vision system must integrate many local

measurements. In the mean time, the system must

segment the local measurements because of the ex-

istence of multiple motions. Accordingly, a Baye-

sian motion perceptive theory is developed, whichcan combine different measurements as well as

prior knowledge while taking their degree of cer-

tainty into account. Indeed our MRF–MAP–MFT

technique belongs in the Bayesian strategies.

A prior probability of the Bayesian model de-

veloped in (Weiss and Adelson, 1998) incorporates

two notions: slowness and smoothness. As already

detailed in (Ullman, 1979), the HVS tends tochoose the ‘‘shortest path’’ or ‘‘slowest’’ motion

consistent with given video date as the most


reasonable perceptive results, i.e., the normal ve-

locity is more often than not preferred. However,

the bias toward slow velocity is not without prob-

lem: due to the fact that for any given image se-

quence, the slowest velocity field conforming to the

visual data is that in which each point along acurved contour moves in the direction of its nor-

mal, it thus leads to highly non-rigid motion per-

cepts. This phenomenon is demonstrated by a

classical scenario provided in (Hildreth, 1983) when

a perfect circle translates horizontally, as depicted

in Fig. 1, the corresponding slowest velocity field is

highly non-rigid. Therefore, another bias toward

‘‘smooth’’ velocity fields is necessitated, i.e., adja-cent locations in an image should have similar ve-

locities. Based on these two preferences, a prior

probability on motion fields can be defined such

that penalties are levied to the magnitude of ve-

locity and the difference between adjacent motion

vectors, which is used to enforce slow and smooth

velocity, respectively. Indeed the smoothness of

velocities at adjacent locations is handled elegantlyby clique potentials in the MRF–MAP–MFT

framework. However, the slow bias is not embed-

ded yet. In the VO segmentation algorithm to be

presented later, by assigning larger weights to slow

motion and smaller weights to rapid movements in

our energy functions, the slowness bias is added

into the MRF–MAP–MFT technique.

2.3. Motion boundary field and its utility in object

segmentation

The smoothness bias in motion perception as

presented in the previous section indicates that

Fig. 1. The illustration of non-rigid velocity fields in case where

only the ‘‘slowness’’ bias is applied: Left: a horizontally moved

circle; Right: the slowest velocity field consistent with visual

data.

motion velocities at adjacent locations are similar.

However, the smoothness prior is violated when

adjacent pixels belong to regions of different ob-

jects. If this bias is enforced universally, the over-

smoothness artifact is then present. In view of the

problems of the LF in video processing, in (Weiand Li, 1999), a truncation function gð�; �Þ as de-

fined below is introduced to allow for discontinu-

ities between adjacent motion velocities. Suppose

cd is a threshold whose value can change dynam-

ically in the MFT process,~dd1 and~dd2 are the motion

vectors of two adjacent sites:

gð~dd1;~dd2Þ ¼ k~dd1 �~dd2k if k~dd1 �~dd2k6 cdgðg < cdÞ otherwise

�ð6Þ

where cd is maxf8e�i=8; 4g, i is the number of the

current iteration. The rationale behind this defi-

nition is: if the magnitude of the difference of themotion vectors on two adjacent sites is too large,

the two sites are then not considered in the same

object and a small penalty is levied for the ap-

pearance of this discontinuity.

As derived and substantiated by our experi-

ments (Wei and Li, 1999), the benefit provided

by the introduction of the preceding truncation

function is the computational commodity due tothe fact that the decision as to the motion dis-

continuity is a mere check on the difference of

adjacent velocities. It is one of the major reasons

for the efficiency achieved by the TPA. However,

the emergence of motion discontinuity itself carries

an especially important information about visual

contents of a video: it indicates the potential lo-

cation of the boundary of VOs. If they are indeedboundary locations of a certain VO, then a certain

structure such as connectness has to be satisfied.

For instance, an isolated boundary locations are

more often than not spurious and should be re-

moved. Therefore, it is appropriate to employ

another MRF to reflect these boundary locations

within which the discontinuity of boundary sites

should be penalized. With the aforementionedfeature, we refer to this new MRF as MBF. The

similarity of this new field to the LF is:

• like the LF, as depicted in Fig. 2, the sites of the

MBF for an image are also duals of the original

Fig. 2. Sites of the MBF B (cross) are the duals of the original

pixel positions (dot). A site in B between two pixels indicated by

ði; jÞ and ði0; j0Þ is denoted by bði;jÞ;ði0 ;j0 Þ.

2

3

5

6 1

4

0 6

0

4

3

2

1

5

(a) (b)

Fig. 3. Six-connectness for MBF sites. (a) and (b) Neighbors

for an MBF site between two horizontally and vertically adja-

cent image pixels, respectively.


image pixels since they indicate the relationship

between adjacent pixels;

• values given to the MBF is also boolean: 1 is

assigned in the presence of a VO boundary, 0

otherwise.

Unlike the LF where the value of an MBF siteis determined by intensities, whether or not a

boundary location is present is determined by the

velocities on adjacent pixels as well as the behavior

of adjacent MBFs. The prior belief of the MBF

under our Bayesian scheme is:

1. connected in terms of six-connection neighbor-

hood system as depicted in Fig. 3.2. spatial smoothness due to the fact that they form

the possible contour of a VO.

Therefore penalties are levied to violations of

these two prior beliefs, such as discontinuities and

non-smoothness of the MBF. These prior beliefs

can be materialized by assigning different clique

potentials to various structures of the MBF. InFig. 4, one clique potential assigning scheme for

Fig. 3(a) is illustrated. Neighborhood for Fig. 3(b)

can be done in like manner: through a 90� rotationof Fig. 4. As can be seen the isolated MBF site (a)

is harshly penalized by giving a large potential,

whereas the smoothest possible boundary: a line in

(b) is encouraged by assigning the smallest po-

tential magnitude. Given that estimates of motion

vectors are determined by the similarity of two

macro-blocks or pixels together with the bias

of slowness, the MBF is indeed determined bythe combination of spatial as well as temporal

characteristics of the video. Furthermore, the as-

signment of clique potentials ensures desired geo-

metrical behavior of the sought-after visual object

boundary.

With the introduction of a boolean MBF, the

smoothness bias is turned off for velocities on

adjacent pixels should the MBF between themtake the value of 1�s. Another important issue

which has to be addressed is the paradox raised

by the MBF––the interdependencies between ve-

locities and MBF assignments: the assignment to

a site in the MBF relies on the velocities on the

two image pixels, while in estimating motion

vectors the enforcement of smoothness on these

two velocities is decided by the MRF assign-ments. To circumvent this problem, we go with

the same strategy as the aforementioned trunca-

tion function: initially fewer 1�s are assigned to

the MBF sites in order to allow for more content-

based interactions among adjacent pixels/macro-

sites. As processing goes on, the threshold to

declare a motion boundary is increased to cir-

cumvent the over-smoothness artifacts for thefinal estimates.

Fig. 4. A clique potential assigning scheme for the MBF, where violations of connectness and smoothness are penalized. Special care is

taken for three-site clique potential assignments due to their different smoothness: collinear three-site cliques are given smallest

penalties V3s; V3c refers to those three boundary sites close to line, i.e., 6 0 2, 5 0 1, 5 0 3, 2 0 4; V3v corresponds to those V -shape three-site cliques: 1 0 6, 6 0 5, 4 0 3, 3 0 2, 1 0 2; we make no difference for 4-, 5-, 6-, and 7-site clique potentials.


3. Proposed visual object segmentation algorithm

In this section the new visual object segmenta-

tion algorithm is stipulated based on the principles

presented in the preceding section. First we shall

formulate MRF models used by our VO segmen-

tation algorithm, then the algorithm of VO seg-

mentation itself is given.

3.1. MRF model of the VO segmentation algorithm

In video processing, there are two regions pre-

sent in a target frame: one is those whose corre-

spondences can be found in the reference frame,

while the other is those which cannot find their

correspondences due to occlusion or newly present

scene. The former is referred to as predictable re-

gion Spredictable, and the latter is denoted by unpre-

dictable region Sunpredictable. For motion estimation,

two MRFs are needed: one is the motion field Dwhich assigns one velocity for each pixel or macro-

block, the other is the unpredictable field O indi-cating the presence of occlusion or newly present

scenes.

In the TPA, to effect efficient partition, a

‘‘double threshold’’ preprocessing step is carried

out: for one site or macro-block Mt in a target

frame, use block matching algorithm to locate a

site/macro-block Mr in the reference frame with

error energy e which minimizes the MSE. Based onthe magnitude of e, three fields are assigned:

1. If e is greater than a prescribed threshold d1 oflarger magnitude, Mt is deemed as belonging to

Sunpredictable.2. If e is smaller than another threshold d2, Mt is

labeled as Spredictable.3. Otherwise, Mt is categorized as an instance of a

third field referred to as uncertain Suncertain whosefinal assignment into predictable or unpredict-

able can only be obtained after the MFT opti-

mization process.


The previous partition results in the following

processing schemes:

1. sites in Sunpredictable are excluded in ensuing MFT

optimization procedure since no meaningful es-

timates for D or O can be achieved;2. the random variable corresponding to O is not

defined on sites belonging to Spredictable;3. only on Suncertain both random variables d for D

and o for O are defined.

Following other vision researcher�s lead in

MRF modeling (Chellappa and Jain, 1993; Li,

1995; Besag, 1986), the neighborhood systems Nfor both D and O are both chosen to be four-

connection or first-order (Sanz, 1996). Two types

of cliques thus present: single-site and double-site.

No difference is made for horizontal or vertical

double-site clique potentials in our implementa-

tion.

Our MRF based VO segmentation algorithm

therefore consists of three MRFs:

1. The vector motion field D: for each image pixel

ði; jÞ ~ddði; jÞ is an integer vector indicating the

corresponding motion vector. The use of bold

font is to emphasize the fact that~ddði; jÞ is a vec-

tor.

2. The boolean boundary field B: for each MBF

site between two pixels ði; jÞ and ði0; j0Þ bði;jÞ;ði0;j0Þsignifies if a boundary presents at between

ði; jÞ and ði0; j0Þ.3. The boolean unpredictable field O: for each

pixel ði; jÞ oði; jÞ indicates if ði; jÞ is unpredict-

able based on currently available knowledge.

Energy functions corresponding to these three

MRFs are given below.(1) Energy function of D:

Ui;jð~ddði; jÞÞ ¼ ð1� oði; jÞÞU~ddði;jÞ þ kdð1� oði; jÞÞ

�X

ði0 ;j0Þ2Nði;jÞ

ð1� oði0; j0ÞÞð1� bði;jÞ;ði0 ;j0ÞÞ

� k~ddði; jÞ � h~ddði0; j0Þik; ð7Þ

on the RHS, the first term is the single-site

clique potential, which measures the pixelwise or

blockwise difference between the pixel/block in the

target frame and the one in the reference frame

with a motion vector ~ddði; jÞ:U~ddði;jÞ ¼ jf t

i;j � f ri1;j1

j; ð8Þwhere ði1; j1Þ is the position ði; jÞ þ~ddði; jÞ, and the

reason to choose the MAD instead of the mean

square energy as a measure of the similarity be-

tween two sites/blocks is that the former is far

more resilient to outliers than the latter, hence

more robust results can be effected (Hampel et al.,

1986). The second term on the RHS of Eq. (7)corresponds to the smoothness of the velocity

which is enforced if the following three cases are

not present:

1. the current pixel/block does not belong to the

unpredictable field ðoði; jÞ ¼ 0Þ; and2. the neighbor is not an unpredictable site

ðoði0; j0Þ ¼ 0Þ; and finally3. the two adjacent sites belong to the same VO

ðbði;jÞ;ði0 ;j0Þ ¼ 0Þ.kd is a parameter which is used to adjust the

impact of smoothness: a larger magnitude in-

duces smoother velocity, and vice versa.

The MFT optimization process can be per-

formed using Eqs. (7), (5) and (2), which is theactual MAP search employed in (Wei and Li,

1999; Zhang, 1992). However, the corresponding

prior belief for velocities of this scheme is that they

are uniform, which is not in concert with the

slowness of human percepts. To enforce the prin-

ciple of slow velocity as a reasonable prior belief in

our Bayesian approach, more priorities should be

given to slow velocities. Therefore the followingmean field computation is employed by plugging

Eq. (2) into Eq. (5) with a weight assigning

scheme:

h~ddði; jÞi ¼ 1

Z

X~ddði;jÞ

~ddði; jÞ exp½�bwði; jÞUð~ddði; jÞÞ�;

ð9Þwhere wði; jÞ is a function I2 ! Iþ whose value is

in inverse proportion to eccentricity of j~ddði; jÞjfrom 0. For instance, it may be defined as:

wði; jÞ ¼ expð�j~ddði; jÞ �~ccjÞ; ð10Þ


where constant vector~cc is taken to be ð1; 1ÞT, thusj~ddði; jÞ �~ccj is simply the sum of the absolute values

of the components of ~ddði; jÞ. The partition func-

tion Z of Eq. (9) is accordingly formulated as

below:

Z ¼X~ddði;jÞ

exp½�bwði; jÞUð~ddði; jÞÞ�: ð11Þ

(2) Energy function for O:The energy function for the unpredictable field

O is given below.

Ui;jðoði; jÞÞ ¼ oði; jÞ½Co � kpUh~ddði;jÞi�

þ kqX

ði0 ;j0Þ2Nði;jÞ

ð1� bði;jÞ;ði0;j0Þ

� joði; jÞ � hoði0; j0Þij: ð12Þ

The second term on the RHS of Eq. (12) is meantto handle the smoothness of the unpredictable

field.

(3) Energy function for B:

Uði;jÞ;ði0;j0Þðbði;jÞ;ði0 ;j0ÞÞ¼ ð1� bði;jÞ;ði0 ;j0ÞÞð1� oði; jÞÞð1� oði0; j0ÞÞg

� ð~ddði; jÞ;~ddði0; j0ÞÞ þ kbXc

Vc; ð13Þ

where gð�; �Þ is the truncation function, kb is a

control parameter regarding the smoothness of the

MBF, Vc�s are all possible clique potentials de-

picted in Fig. 4.

3.2. The VO segmentation algorithm

Based on MRF models as formulated previ-

ously, we are ready to present the VO segmenta-

tion algorithm which is to search the joint fixed

point for D, O, and B on the same track of the

TPA.

1. Carry out the ‘‘double thresholding’’ step to

partition sites in the target frame into

Spredictable, Suncertain, and Sunpredictable.2. (a) Use Eqs. (7) and (9) to compute the mean

field of D.(b) Use Eqs. (12) and (5) to compute the mean

field of O.

(c) Use Eqs. (13) and (5) to compute the mean

field of B.3. Compute the normalized difference ek of the

three mean fields for iteration k:

ek ¼ ½kh~ddðkÞi � h~ddðk�1Þik2 þ kOðkÞ � Oðk�1Þk2

þ kBðkÞ � Bðk�1Þk2�12=kSk: ð14Þ

If ek is smaller than a prescribed threshold �1 ordek ¼ ek � ek�1 is smaller than another given

threshold �2ð< �1Þ, exit; goto step (a) otherwise.4. Carry out component labeling algorithm (Jain

et al., 1995) to remove those linked MBFs

whose number is less than 5.

In the preceding algorithm, a joint fixed point

for the three involved MRF�s is deemed to be

achieved if ek or the change of ek between two

contiguous iterations dek is especially small.

4. Implementation details and experimental results

In this section, experimental results conducted

based on the algorithm developed in the preceding

section are reported. Control parameters presented

in the formulas of the segmentation algorithm are

given in Table 1. These values are mostly obtained

by trial and error. Monte Carlo studies are still

needed to arrive at optimal parameters. Excellent

discussions on this topic can be found in (Celeuxet al., 2003).

All frames of our experimental videos are 256

level gray scale images. Two sets of results will be

presented: one is synthetic image sequences where

the reference and target frames are generated with

known shape of object and velocities. The other is

videos downloaded from public domain. There is

no difference for our algorithm to work based onpixels or macro-blocks, in this section all results

are based on 4 · 4 macro-block.

4.1. Synthetic image sequences

In this section, first experiments conducted on

two synthetic image sequences are reported. One

contains a single moving object while the other hastwo moving objects. We then present statistical

studies on more synthetic videos.

Table 1

Values chosen for each parameter in our implementation

Parameter Chosen value Parameter Chosen value Parameter Chosen value

b 1.0 kq 6 kd 12.8

kb 4.5 cp1 40 cp2 10

cd maxf8e�i=8; 4g g cd=2 Co 16

�1 0.02 �2 0.01


Two 256 · 256 synthetic images are obtained inthe following way: an image I is generated by the

realization of an identically independent distrib-

uted Gaussian random process with mean value

l ¼ 128, and standard deviation r ¼ 10. Then a

64 · 64 square in the center of I is moved with a

motion vector ð2; 3Þ. The disoccluded area is filled

with samples from the same Gaussian random

process as that generating I . The resulting image isreferred to as I 0. Finally, white Gaussian noise with

r ¼ 1 is then added to both I and I 0. They are

viewed as the reference and target frame, respec-

tively, in the object segmentation process. These

two images and the resulting MBF is shown in Fig.

5.

A second synthetic image sequence consists of

two objects using the same Gaussian process andpatching scheme as the previous single object case.

Two 64 · 64 squares around the center move with

Fig. 5. Synthetic image sequences and their corresponding

MBF images. Row 1: one moving object, Row 2: two adjacent

moving objects. Column 1: reference frame, Column 2: target

frame, Column 3: MBF image.

the velocities ð0; 3Þ and ð0;�3Þ, respectively. Andagain white noise with r ¼ 1 is added to these two

images. The reference and target frames together

with the corresponding MBF image are illustrated

in Fig. 5. We arrived at these two estimation re-

sults within 15.67 and 15.71 s, respectively. Our

program works written in C++ under the OS of

RedHat Linux 7.3. The computer we work with is

Pentium III at 850 MHz with 130 MB RAM. Andwe used the clock (real) time returned by the sys-

tem call time as our timing results.

It can be observed from Fig. 5 that the MBFs

estimated from these two synthetic sequences are

of satisfactory performance.

In order to gain more insights into the perfor-

mances of our algorithm, a statistical study is

conducted on more synthetic images. We generate25 synthetic sequences with two moving objects in

like manner to row 2 of Fig. 5. These synthetic

sequences are generated by 25 different stds rang-

ing from 5 to 18 with interval 0.5. In Fig. 6, the

number of errors and their corresponding com-

putational time are demonstrated. It can be ob-

served that when the objects are of reasonable

textures, e.g., r > 6, the estimated boundaries areof high quality, namely, the accurate ratio is well

above 90% (1� 10/136) since the errors are less

than 10. Indeed when r > 12 it is extremely un-

likely from our experiments that there are more

than one error. Moreover, the processing time is

also relatively stable at less than 20 s in our com-

puter which, as aforementioned, is not with deep

computational power at all. However, when thetextures are too small, e.g., r < 6, the aperture

problem simply dominates our estimation and the

performance significantly drops (cf. the sharp in-

crease of errors when r < 6). In cases where r < 5

the estimates are too unstable and it is thus sta-

tistically meaningless to put them in the diagram.

5 7 9 11 13 15 1705

1015202530

MBF Errors for different std’s

0

10

20

30

40

50

7 9 11 13 15 17

Running time for different std’s

5

Fig. 6. Statistical results for 25 synthetic sequences with two moving objects. Horizontal axes for both diagrams indicate the std�s weused to generate the frames. Vertical axis for Column 1: number of MBF errors, Column 2: processing time measured in second.


4.2. Real-world videos

Extensive experiments have been conducted on

public videos 2 In Fig. 7, four sets of experimental

videos and their MBF images are depicted. For

sequence Salesman, the estimated MBFs essen-

tially detected three regions of apparent motion:

his face and two arms. In sequence Table tennis,

the ball, the player�s two hands and the plate arereflected well by the MBFs. The MBFs for se-

quence claire largely revealed the talking face, the

additional boundaries present in the middle of her

face is due to the relatively larger motion of the

region around her nose. Due to the smoothness of

the region between her mouth and the chin, her

chin tip and the other part of her face are sepa-

rated. As for the MBFs for hall monitor, the upperboundary evidently corresponds to the upper part

of his torso: the internal boundary is caused by the

motion of his left arm––the disocclusion presents

between his left arm and his torso. Whereas the

lower half of his left leg with considerable motion

is covered by the other group of MBFs.

5. Conclusion

In this paper, in light of the slow and smooth

prior beliefs of the human motion percepts, we

develop a new visual object segmentation algo-

rithm using the MRF–MAP–MFT, namely, the

2 Videos used in this article are downloaded from http://

www.ipl.rpi.edu/resource/sequences/index.html, thanks to the

Center for Image Processing Research of Rensselaer Polytech-

nic Institute.

MRF is used to model contextual constraints of

video data, and the MAP configuration is soughtafter using the MFT as the desirable configuration.

Instead of the LF originally proposed in the dis-

cipline of image restoration to circumvent the

over-smoothness artifacts, we formulate a novel

MRF MBF to handle the appropriate smoothness

and discontinuity of velocities among adjacent

pixels/macro-blocks. The prior beliefs of a VO

boundary, i.e., continuity and smoothness, can bereadily embedded in the clique potentials upon

which the MFT optimization is conducted. Ex-

perimental results on both synthetic and real-

world scenes have demonstrated encouraging

performance.

The visual object boundary extracted by the

proposed algorithm is exactly the contour of an

alpha-plane as presented in MPEG-4 proposals(Zhang et al., 1997). It thus achieves the valuable

task of object retrieval. Automatic visual object

retrieval from raw video data is of vital importance

in the state-of-the-art research on disciplines of

video processing and digital libraries due to its

ability of providing semantically meaningful unit,

i.e., visual object, for efficient coding, indexing,

and manipulation in a broad spectrum of appli-cations such as video compression, video content-

based indexing and recognition, virtual reality,

video synthesis, etc. It is thus not a surprise that

the concept of visual object is of utmost impor-

tance in the ongoing standardization effort of

MPEG.

The proposed visual object segmentation algo-

rithm is not without problem: it can only extractmoving objects of distinguishable texture from the

background. This phenomenon has been demon-

http://www.ipl.rpi.edu/resource/sequences/index.html

http://www.ipl.rpi.edu/resource/sequences/index.html

Fig. 7. Experimental results conducted on publicly available real world videos. The first two columns: the two frames our algorithm is

performed to; the third column: the estimated MBF. The dimension and running time of our algorithm for each sequence are also given

after the sequence names.


strated by Fig. 6 where when the generating r is

less than 6 our algorithm broke down. More re-

fined method is still needed to alleviate this aper-

ture problem. Possible avenues are to embed more

a priori knowledge, in addition to general knowl-

edge like slow and smooth principles, about the


scene and visual objects into our energy formula-

tion. For instance, bias to certain specific textures

or spatial statistical structures can be put into our

MRF model effectively.

Acknowledgements

This work is sponsored by Office of Naval Re-

search under contract no. N000140210122.

References

Aloimonos, J., Bandopadhay, A., Weiss, I., 1988. Active vision.

Internat. J. Comput. Vision 1 (4), 333–356.

Ballard, D.H., 1991. Animate vision. Artificial Intell. 48, 57–86.

Barnard, S., 1993. Stereo matching. In: Chellappa, R., Jain,

A.K. (Eds.), Markov Random Fields. Academic Press,

pp. 245–271.

Besag, J., 1986. On the statistical analysis of dirty pictures (with

discussions). J. Roy. Statist. Soc. Ser. B 48, 259–302.

Borshukov, G.D., Bozdagi, G., Altunbasak, Y., Tekalp, A.M.,

1997. Motion segmentation by multi-stage affine classifica-

tion. IEEE Trans. Image Process. 6 (11), 1581–1594.

Celeux, G., Forbes, F., Peyrard, N., 2003. EM procedures using

mean field-like approximations for Markov model-based

image segmentation. Pattern Recognition 36 (1), 131–144.

Chandler, D., 1987. Introduction to Modern Statistical Me-

chanics. Oxford University Press.

Change, S.-F. et al., 1997. VideoQ: An automated content-

based video search system using visual cues. In: Proc. ACM

Multimedia.

Chellappa, R., Jain, A.K., 1993. Markov Random Fields.

Academic Press.

Drew, M.S., Wei, J., Li, Z.N., 1998. On illumination invariance

in color object recognition. Pattern Recognition 31 (8),

1077–1087.

Flickner, M. et al., 1995. Query by image and video content:

The QBIC system. Proc. IEEE 83 (9), 23–31.

Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs

distributions, and the Bayesian restoration of images. IEEE

Trans. Pattern Anal. Machine Intell. 6 (6), 721–741.

Hager, G.D., Belhumeur, P.N., 1998. Efficient region tracking

with parametric models of geometry and illumination. IEEE

Trans. PAMI 20 (10), 1025–1039.

Hammersley, J.M., Clifford, P., 1971. Markov field on finite

graphs and lattices. Unpublished.

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.,

1986. Robust Statistics––The Approach Based on Influence

Functions. John Wiley and Sons.

Healey, G., Slater, D., 1994. Global color constancy: Recog-

nition of objects by use of illumination-invariant properties

of color distribution. J. Opt. Soc. Am. A 11 (11), 3003–3010.

Hildreth, E.C., 1983. The Measurement of Visual Motion. MIT

Press.

Irani, M., Hsu, S., Anandan, P., 1995. Video compression

using mosaic representations. Signal Process.: Image Com-

mun. 7 (4), Special issue on coding techniques for low bit-

rate video.

Jain, J., Jain, A., 1981. Displacement measurement and its

applications in interframe image coding. IEEE Trans.

Commun. 29 (12), 1799–1808.

Jain, R., Kasturi, R., Schunck, B.G., 1995. Machine Vision.

McGraw-Hill.

Koga, T. et al., 1993. Motion compensated interframe coding

for video conferencing. In: Proc. Nat. Telecommun. Conf.

pp. 5.3.1–5.3.5.

Konrad, J., Dubois, E., 1992. Bayesian estimation of motion

vector fields. IEEE Trans. Pattern Anal. Machine Intell. 14,

910–927.

Krishnamachari, S., Chellappa, R., 1997. Multiresolution

Gauss–Markov random field models for texture segmenta-

tion. IEEE Trans. Image Process. 6 (2), 251–267.

Li, S.Z., 1995. Markov Random Field Modeling in Computer

Vision. Springer-Verlag.

Marr, D., 1982. Vision. W.H. Freeman.

Memin, E., Perez, P., 1998. Optical flow estimation and object-

based segmentation with robust techniques. IEEE Trans.

Image Process. 7 (5), 703–719.

Pentland, A., Picard, R.W., Sclaroff, S., 1996. Photobook:

Content-based manipulation of image databases. Internat.

J. Comput. Vision 18 (3), 233–254.

Picard, R.W., Pentland, A.P. (Eds.), 1996. Special issue on

digital library. IEEE Trans. PAMI 18(8).

Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and

Outlier Detection. Wiley.

Sanz, J.L.C. (Ed.), 1996. Image Technology. Springer-Verlag.

Srinivasan, R., Rao, K.R., 1985. Predictive coding based on

efficient motion estimation. IEEE Trans. Commun. 33, 888–

896.

Swain, M.J., Ballard, D.H., 1991. Color indexing. Internat. J.

Comput. Vision 7 (1), 11–32.

Ullman, S., 1979. The Interpretation of Visual Motion. The

MIT Press.

Wang, J.Y.A., Adelson, E.H., 1994. Representing moving

images with layers. IEEE Trans. Image Process. 3 (5),

625–638.

Wei, J., 2002. Color object indexing and recognition in

digital libraries. IEEE Trans. Image Process. 11 (8), 912–

922.

Wei, J., Li, Z.N., 1999. An efficient two-pass MAP–MRF

algorithm for motion estimation based on mean field theory.

IEEE Trans. Circ. Sys. Video Tech. 9 (6), 960–972.

Weiss, Y., Adelson, E.H., 1998. Slow and smooth: A Bayesian

theory for the combination of local motion signals in human

vision. Technical Report 1624, MIT AI Lab.

Woo, W., Ortega, A., 1996. Stereo image compression with

disparity compensation using the MRF model. In: Proc.

SPIE Visual Communication and Image Processing

(VCIP�96), vol. 2727. pp. 28–41.


Yarbus, A.L., 1967. Eye Movements and Vision. Plenum, New

York.

Zhang, J., 1992. Mean field theory in EM procedures

for MRF�s. IEEE Trans. Signal Process. 40 (10), 2570–

2583.

Zhang, J., Hanauer, G.G., 1995. The application of mean field

theory to image motion estimation. IEEE Trans. Image

Process. 4 (1), 19–32.

Zhang, Y.-Q., Zafar, S., 1992. Motion-compensated wavelet

transform coding for color video compression. IEEE Trans.

Circ. Sys. Video Tech. 2 (3), 285–296.

Zhang, Y.-Q. et al. (Ed.), 1997. Special issues on MPEG-4.

IEEE Trans. Circ. Sys. Video Tech. 7(1).

Zoghlami, I., Faugeras, O., Deriche, R., 1997. Using geometric

corners to build a 2d mosaic from a set of images. In: Proc.

IEEE CVPR�97. pp. 420–425.

MRF–MAP–MFT visual object segmentation based on motion boundary field

Documents

Transcript of MRF–MAP–MFT visual object segmentation based on motion boundary field