MRF–MAP–MFT visual object segmentation based on motion boundary field
Transcript of MRF–MAP–MFT visual object segmentation based on motion boundary field
Pattern Recognition Letters 24 (2003) 3125–3139
www.elsevier.com/locate/patrec
MRF–MAP–MFT visual object segmentationbased on motion boundary field
Jie Wei *, Izidor Gertner
Department of Computer Science, The City College of the City University of New York,
Convent Avenue at 138th Street, New York, NY 10031, USA
Received 31 March 2003; received in revised form 16 July 2003
Abstract
In our earlier work, a two-pass motion estimation algorithm (TPA) was developed to estimate a motion field for two
adjacent frames in an image sequence where contextual constraints are handled by several Markov random fields
(MRFs) and the maximum a posteriori (MAP) configuration is taken to be the resulting motion field. In order to
provide a trade-off between efficiency and effectiveness, the mean field theory (MFT) was selected to carry out the
optimization process to locate the MAP with desirable performance. Given that currently in the disciplines of digital
library [IEEE Trans. PAMI 18 (8) (1996); IEEE Trans. Image Process. 11 (8) (2002) 912] and video processing [IEEE
Trans. Circ. Sys. Video Tech. 7 (1) (1997)] of utmost interest are the extraction and representation of visual objects,
instead of estimating motion field, in this paper we focus on segmenting out visual objects based on spatial and
temporal properties present in two contiguous frames in the same MRF–MAP–MFT framework. To achieve object
segmentation, a ‘‘motion boundary field ’’ is introduced which can turn off interactions between different object regions
and in the mean time remove spurious object boundaries. Furthermore, in light of the generally smooth and slow
velocities in-between two contiguous frames, we discover that in the process of calculating matching blocks, assigning
different weights to different locations can result in better object segmentation. Experimental results conducted on both
synthetic and real-world videos demonstrate encouraging performance.
� 2003 Elsevier B.V. All rights reserved.
Keywords: Markov random field; Visual motion; Segmentation; Mean field theory
1. Introduction
In video compression schemes, apart from
taking advantage of spatial redundancies among
adjacent pixels, temporal redundancies between
* Corresponding author. Tel.: +1-212-650-5604; fax: +1-212-
650-6284.
E-mail address: [email protected] (J. Wei).
0167-8655/$ - see front matter � 2003 Elsevier B.V. All rights reserv
doi:10.1016/S0167-8655(03)00180-6
neighboring frames are exploited extensively toreduce the bandwidth required to save or transmit
video data. For each macro-block in the target
frame, a macro-block of the same size in the vi-
cinity of the corresponding position of the refer-
ence frame is located based on a certain energy
minimization criterion, e.g. minimal mean square
error (MMSE). A vector indicating the difference
between these two positions is saved in the motion
ed.
3126 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
map as the corresponding motion vector or velocity,
whereas the pixelwise differences between these two
pixels/macro-blocks are stored in the residual map.
The coding result for the target frame thus en-
compasses these two maps. The process of search-
ing for the matching block of theMMSE is denotedby motion estimation and is understandably the
most time-consuming phase within this framework.
Most existing video compression schemes work
along this track, such as H.263, MPEG-1. There
are basically two avenues to enhance the perfor-
mance of this scheme: one is to make the process of
motion estimation more efficient, denoted by effi-
ciency-oriented schemes, e.g., three-step algorithm(Koga et al., 1993), and conjugate searching
(Srinivasan and Rao, 1985), wavelet-based esti-
mation (Zhang and Zafar, 1992). The other is to
generate a motion map which provides lower en-
tropy for both encoding maps, denoted by con-
textual motion schemes. Most methods proposed so
far can be classified into three categories: (1)
Parametric model based techniques are employedto retrieve object of motion (Wang and Adelson,
1994; Hager and Belhumeur, 1998; Borshukov
et al., 1997), here object shapes are extracted by use
of affine or full-perspective motion models. This
school of motion segmentation has received in-
tensive attentions recently among vision resear-
chers due to its solid three-dimensional modeling.
However, the inefficiency and sensitivity to noisehave been its major hurdle to be employed widely
in industrial applications. (2) Markov random field
(MRF) is used to model the contextual constraints,
and the maximum a posteriori (MAP) configura-
tion for the MRF corresponding to velocities is
computed to be the sought-after motion field. This
category of techniques is denoted by MRF–MAP
framework and has been extremely successful inachieving motion segmentation (Li, 1995; Zhang
and Hanauer, 1995; Memin and Perez, 1998). This
category of schemes also has much to be desired in
efficiency due to the fact that the search for MAP is
extremely time consuming. (3) Scene mosaic is
created as the representation of a given video (Irani
et al., 1995; Zoghlami et al., 1997). This method
takes a fresh new avenue to achieve video com-pression: efforts are made to find homographies
between overlapping frames in order to generate an
image mosaic with respect to the scene, the resul-
tant scene mosaic, a mere image, is then deemed
as the representation. Extremely impressive com-
pression results have been reported. However, this
scene mosaic idea is only able to achieve the best
results in cases where the scene commanded by acamera is a static background. Furthermore, its
efficacy is heavily dependent on the computed ho-
mographies whose computation is always unreli-
able. Therefore its applicability is greatly limited.
In this paper, we are about to develop a method in
the second category with improved efficiencies.
In (Wei and Li, 1999), we developed a two-pass
algorithm (TPA) within the MRF–MAP frame-work to estimate motion field with desirable per-
formance. Three novel techniques contribute to
the effectiveness and efficiency achieved by the
TPA:
1. A pre-processing step to partition a target
frame into three groups, namely, unpredictable,
uncertain, and predictable group, which corre-spond to macro-blocks which are occluded or
not. Our ensuing estimation process thus needs
only to be performed on uncertain and predict-
able sites.
2. Instead of inducing an explicit line field (LF) to
circumvent the universal interactions among
neighboring sites, a truncation function whose
threshold varies as the optimization proceedsis employed to enforce interactions between
neighboring blocks.
3. The mean absolute difference (MAD) instead of
the generally used mean square error (MSE) is
used as the criterion of energy resemblance/dif-
ference. The MAD criterion is far more resilient
to outliers than the MSE, its efficacy is substan-
tiated by the theory of robust statistics (Rous-seeuw and Leroy, 1987) and our extensive
experiments in both synthetic and real world
scenes.
The TPA outperforms other existing MRF–
MAP algorithms in its efficiency without much loss
of estimation precision.
As described in (Woo and Ortega, 1996), asidefrom achieving a better video or stereo image
compression by providing lower bit rates for
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3127
motion maps, contextual motion schemes play
essential roles in ensuing video understanding
tasks due to the fact that it provides important
cues regarding the video�s visual contents. In
MPEG-4 (Zhang et al., 1997), a new concept vi-
sual object (VO) emerges as the central theme forrepresenting multimedia applications. Unlike pre-
vious MPEG standardization efforts, where the
fundamental unit upon which the coding process
works is pixels or macro-blocks which bear with
no perceptive meaning at all, VO in MPEG-4 is a
semantically meaningful unit a user is allowed to
access and manipulate. The upcoming MPEG-7
purports to provide a content representationstandard for visual content-based search in digital
libraries. VO is able to not only facilitate more
effective video compression, more importantly, it
can also find its utility in a broad spectrum of
applications such as video synthesis, virtual reality,
digital libraries (Picard and Pentland, 1996; Wei,
2002), etc. Therefore, VO has an essential role to
play in the development of state-of-the-art tech-niques, its retrieval and representation has received
more and more attention among academia and
industry (Change et al., 1997; Pentland et al., 1996;
Flickner et al., 1995).
VO in its own right is a semantically meaningful
concept, or referred to as high-level knowledge in
terms of Marr (1982), which can only be arrived at
by human intelligence, perceptive process to bemore specific (Yarbus, 1967). However, techniques
currently used in video processing and computer
vision belong to the low- or intermediate-level
processing. This is a sensing process. There is a
great conceptual gap between these two processes.
More information or a priori knowledge about the
VO is required to bridge them closer. For a still
image, it is impossible to extract VO without cor-responding prescribed knowledge. Indeed even for
human beings it is virtually impossible to distin-
guish a well-camouflaged fly from the background.
However, for a video, given that contiguous
frames are of similar visual contents, the generally
different velocities of the background and objects
provide strong cues for the extraction of VOs. For
instance, it incurs little difficulty for human beingsto find a moving fly no matter how perfect its
camouflage is. In this paper, we attempt to exploit
motion information present in image sequences for
the purpose of extracting VO which is of different
motion from the background. We will work in the
same framework as the TPA, namely, using MRF
to model the contextual constraints and MFT in
search of the MAP configuration, it is referred toas the MRF–MAP–MFT technique in the sequel.
In the TPA, adjacent sites whose interactions
are turned off by a truncation function are likely to
be the boundaries of a VO from the background.
However, the TPA gives no consideration to
interactions between those boundary locations
which is problematic in arriving at a reasonable
boundary description of VOs. In the seminal paperof Geman and Geman (1984) which addressed
issues regarding the image enhancements, a novel
concept LF is introduced to circumvent universal
interactions between spatially adjacent sites. The
LF is a dual field of the original image pixel sites,
i.e., there exists exactly one site in the LF in-
between two adjacent pixels in the original image
along both the horizontal and the vertical direc-tions. The LF is a boolean field: if the intensity
value of two neighboring sites in the original image
surpasses a given threshold, the corresponding site
on LF is assigned one; zero otherwise. The inter-
actions among neighboring LF sites are embedded
in search of the MAP configuration to provide a
desirable geometrical structure of boundaries of a
spatial region. The introduction of the LF is hailedas a field-breaking novel technique and its useful-
ness has been substantiated by a wide range of
successful applications in disciplines such as
image/video processing and computer vision (Bar-
nard, 1993; Krishnamachari and Chellappa, 1997;
Konrad and Dubois, 1992). However, as discussed
at length in (Wei and Li, 1999), literal usage of the
LF in the domain of video processing is prob-lematic due to two reasons: (1) detection of object
boundaries solely based on intensity difference is
not appropriate in that intensity discontinuities
generally do not signify object boundaries; (2) the
LF is not well defined for block-based or hierar-
chical estimation which is crucial to enhance the
processing efficiency. Therefore in video process-
ing it is difficult to take advantage of the benefitsprovided by the LF defined in the spatial domain.
In this paper we propose a novel concept motion
3128 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
boundary field (MBF) as the analog of the LF in
video domain, which is able to capture both spatial
and the temporal contextual constraints in seg-
menting out VOs. In the proposed VO segmenta-
tion algorithm, the MBF is added to enforce the
desirable structure of VO boundaries which caneffectively remove spurious boundary sites and
create connected boundaries. One more improve-
ment achieved in this algorithm is the application
of the so-called slow and smooth principle of object
motion as described in the study on cognitive sci-
ence (Weiss and Adelson, 1998). This principle
stipulates that more priorities should be allocated
to sites of near-zero movement and in the meantime velocities on adjacent sites should be similar.
The resultant scheme of this principle is an as-
signment of different weights to positions in the
corresponding vicinity of the reference frame. Ex-
perimental results conducted on both synthetic
and real-world videos demonstrate encouraging
performance of the proposed VO segmentation
algorithm.This paper is organized as follows. The MBF
and slow and smooth principle in the MRF–
MAP–MFT framework are introduced and justi-
fied in the next section. The proposed visual object
segmentation model and algorithm is presented in
Section 3. Section 4 demonstrates the experimental
results of the proposed algorithm for both syn-
thetic and real-world videos. We conclude thispaper in Section 5 with more remarks and dis-
cussions.
1 The component of the velocity of a moving edge in the
direction of the edge cannot be determined uniquely.
2. Object segmentation using MRF–MAP–MFT
In this section, first the MRF–MAP–MFT
technique for motion estimation as developed in(Wei and Li, 1999) is outlined. Next the slow and
smooth movement phenomenon as formulated in
(Weiss and Adelson, 1998) is introduced. Finally,
the novel concept of MBF and its potential power
in object segmentation is presented.
2.1. MRF–MAP–MFT technique
In the video compression discipline, efficiency-
oriented schemes are generally employed (Jain and
Jain, 1981). Useful as they are, giving no consid-
eration to contextual constraints existing in the
spatial and temporal domain in a video, they fail
to generate satisfactory results due to the so-called
‘‘aperture problem’’ 1 and noise present in frames.
In order to obtain more refined motion vectorswhich are able to reflect visual contents of a video,
the MRF has been widely employed to model the
contextual interactions between adjacent sites (Li,
1995). Generally speaking, in the MRF the impact
posed to the value of the random variable on one
site, the pixel in case of images, by those on other
sites is constrained to its close neighbors, where
the ‘‘closeness’’ is determined by the definition ofthe neighborhood system Ni;j. The mathematical
formulation of Markovianity is stipulated as
below:
P ðfi;jjfSm;n�ði;jÞÞ ¼ P ðfi;jjfNi;jÞ: ð1Þ
Eq. (1) indicates the behavior of the random
variable on site ði; jÞ is only affected by those sites
in Ni;j. The maximum a posteriori (MAP) prob-
ability is most commonly used as a statistical cri-
terion for optimality and thus often chosen in
conjunction with the MRF in vision modeling (Li,
1995). The resulting framework, referred to as
MAP–MRF, is obtained.The MRF lends us a powerful tool in modeling
the local property for images. However, for the
purpose of image processing, the computation
based on the probability density functions (pdfs)
for each site is prohibitively intensive. Due to the
equivalence of the MRF to Gibbs random field
(GRF) (Geman and Geman, 1984; Hammersley
and Clifford, 1971), the joint probability P ðf Þ ofthe configuration f is defined as below:
P ðf Þ ¼ exp½�bUðf Þ�=Z; ð2Þwhere Z is a normalizing term called partition
function:
Z ¼Xf
exp½�bUðf Þ�; ð3Þ
and Uðf Þ is the sum of all possible clique poten-
tials:
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3129
Uðf Þ ¼Xc2C
Vcðf Þ; ð4Þ
where a clique c is a set of sites which are neigh-
bors to each other under a neighborhood system
N , Vcðf Þ is the clique potential for c. Now it is
evident that Pðf Þ is determined entirely by thenature of its local characteristics, i.e., the choice of
the neighborhood system and specific clique po-
tentials which nail down the local interactions
among adjacent sites. Therefore, one can define a
neighborhood system and assign values to those
corresponding clique potentials to reflect one�sprior belief of contextual constraints. The most
possible result obtained from this prior is theMAP, which is the one having the minimal energy
Uðf Þ according to Eq. (2). Thereby a rich arsenal
of optimization methods can be used in search of
the MAP.
There are mainly two categories of techniques
in conducting the optimization process, one is
local methods, the other is global methods (Li,
1995). The common feature shared by all localmethods is that the computation converges rap-
idly, but they are susceptible to getting stuck to
local minima. For global methods, the global
minima can be guaranteed, however, often-times
at the cost of extremely high computational cost.
Mean field theory (MFT), originally an optimiza-
tion method developed in statistical mechanics
(Chandler, 1987), is found to be able to provide adesirable trade-off between performance and effi-
ciency (Zhang, 1992; Wei and Li, 1999). In short,
with the MFT techniques, for one site p, instead of
computing interactions with all its neighboring
particles, one can first compute the mean field
generated by its adjacent sites and then evaluate
the impact exerted to p by this field. Mathemati-
cally the mean field hfi;ji for fi;j is defined as below:
hfi;ji ¼Xfi;j
fi;jP ðf Þ: ð5Þ
The MFT turns the many-body statistical me-
chanics problem into a one-body problem and
thus resulting in reduced computational cost and
relatively desirable performance.
Therefore, the MRF–MAP–MFT technique
consists of three components: (a) use MRF to
model contextual constraints; (b) search for the
MAP as the anticipated result; (c) employ the
MFT technique in the MAP search.
2.2. Slow and smooth motion modeling scheme
As briefed in the first section, with a wealth of
visual cues, video data provides far more informa-
tion, motion especially, than still images which can
be exploited to segment out objects. Visual content
analysis based on spatial properties and motion
have an essential role to play. The behavior of
human vision System (HVS) has inspired many
seminal theories and techniques in computer visionand image processing, such as active vision (Ballard,
1991; Aloimonos et al., 1988) which purports to
emulate the spatial configuration and eye move-
ments of the HVS to actively involve cameras in the
sensing process, and color image indexing (Swain
and Ballard, 1991; Healey and Slater, 1994; Drew
et al., 1998) which takes advantage of the perceptive
capacity of the HVS to provide drastic data re-duction and effective object recognition. To achieve
desirable motion analysis result, it is necessary to
first gain deep insights into the corresponding
workings of the HVS in conducting motion analy-
sis. An excellent study from the perspective of
cognitive science is presented in (Weiss and Adel-
son, 1998), which sheds light on possible avenues
for us to take in order to enhance our motionanalysis procedure. It is pointed out that due to the
inherent ambiguity of local motion signals, such as
aperture problem, in the process of motion per-
ception, a vision system must integrate many local
measurements. In the mean time, the system must
segment the local measurements because of the ex-
istence of multiple motions. Accordingly, a Baye-
sian motion perceptive theory is developed, whichcan combine different measurements as well as
prior knowledge while taking their degree of cer-
tainty into account. Indeed our MRF–MAP–MFT
technique belongs in the Bayesian strategies.
A prior probability of the Bayesian model de-
veloped in (Weiss and Adelson, 1998) incorporates
two notions: slowness and smoothness. As already
detailed in (Ullman, 1979), the HVS tends tochoose the ‘‘shortest path’’ or ‘‘slowest’’ motion
consistent with given video date as the most
3130 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
reasonable perceptive results, i.e., the normal ve-
locity is more often than not preferred. However,
the bias toward slow velocity is not without prob-
lem: due to the fact that for any given image se-
quence, the slowest velocity field conforming to the
visual data is that in which each point along acurved contour moves in the direction of its nor-
mal, it thus leads to highly non-rigid motion per-
cepts. This phenomenon is demonstrated by a
classical scenario provided in (Hildreth, 1983) when
a perfect circle translates horizontally, as depicted
in Fig. 1, the corresponding slowest velocity field is
highly non-rigid. Therefore, another bias toward
‘‘smooth’’ velocity fields is necessitated, i.e., adja-cent locations in an image should have similar ve-
locities. Based on these two preferences, a prior
probability on motion fields can be defined such
that penalties are levied to the magnitude of ve-
locity and the difference between adjacent motion
vectors, which is used to enforce slow and smooth
velocity, respectively. Indeed the smoothness of
velocities at adjacent locations is handled elegantlyby clique potentials in the MRF–MAP–MFT
framework. However, the slow bias is not embed-
ded yet. In the VO segmentation algorithm to be
presented later, by assigning larger weights to slow
motion and smaller weights to rapid movements in
our energy functions, the slowness bias is added
into the MRF–MAP–MFT technique.
2.3. Motion boundary field and its utility in object
segmentation
The smoothness bias in motion perception as
presented in the previous section indicates that
Fig. 1. The illustration of non-rigid velocity fields in case where
only the ‘‘slowness’’ bias is applied: Left: a horizontally moved
circle; Right: the slowest velocity field consistent with visual
data.
motion velocities at adjacent locations are similar.
However, the smoothness prior is violated when
adjacent pixels belong to regions of different ob-
jects. If this bias is enforced universally, the over-
smoothness artifact is then present. In view of the
problems of the LF in video processing, in (Weiand Li, 1999), a truncation function gð�; �Þ as de-
fined below is introduced to allow for discontinu-
ities between adjacent motion velocities. Suppose
cd is a threshold whose value can change dynam-
ically in the MFT process,~dd1 and~dd2 are the motion
vectors of two adjacent sites:
gð~dd1;~dd2Þ ¼ k~dd1 �~dd2k if k~dd1 �~dd2k6 cdgðg < cdÞ otherwise
�ð6Þ
where cd is maxf8e�i=8; 4g, i is the number of the
current iteration. The rationale behind this defi-
nition is: if the magnitude of the difference of themotion vectors on two adjacent sites is too large,
the two sites are then not considered in the same
object and a small penalty is levied for the ap-
pearance of this discontinuity.
As derived and substantiated by our experi-
ments (Wei and Li, 1999), the benefit provided
by the introduction of the preceding truncation
function is the computational commodity due tothe fact that the decision as to the motion dis-
continuity is a mere check on the difference of
adjacent velocities. It is one of the major reasons
for the efficiency achieved by the TPA. However,
the emergence of motion discontinuity itself carries
an especially important information about visual
contents of a video: it indicates the potential lo-
cation of the boundary of VOs. If they are indeedboundary locations of a certain VO, then a certain
structure such as connectness has to be satisfied.
For instance, an isolated boundary locations are
more often than not spurious and should be re-
moved. Therefore, it is appropriate to employ
another MRF to reflect these boundary locations
within which the discontinuity of boundary sites
should be penalized. With the aforementionedfeature, we refer to this new MRF as MBF. The
similarity of this new field to the LF is:
• like the LF, as depicted in Fig. 2, the sites of the
MBF for an image are also duals of the original
Fig. 2. Sites of the MBF B (cross) are the duals of the original
pixel positions (dot). A site in B between two pixels indicated by
ði; jÞ and ði0; j0Þ is denoted by bði;jÞ;ði0 ;j0 Þ.
2
3
5
6 1
4
0 6
0
4
3
2
1
5
(a) (b)
Fig. 3. Six-connectness for MBF sites. (a) and (b) Neighbors
for an MBF site between two horizontally and vertically adja-
cent image pixels, respectively.
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3131
image pixels since they indicate the relationship
between adjacent pixels;
• values given to the MBF is also boolean: 1 is
assigned in the presence of a VO boundary, 0
otherwise.
Unlike the LF where the value of an MBF siteis determined by intensities, whether or not a
boundary location is present is determined by the
velocities on adjacent pixels as well as the behavior
of adjacent MBFs. The prior belief of the MBF
under our Bayesian scheme is:
1. connected in terms of six-connection neighbor-
hood system as depicted in Fig. 3.2. spatial smoothness due to the fact that they form
the possible contour of a VO.
Therefore penalties are levied to violations of
these two prior beliefs, such as discontinuities and
non-smoothness of the MBF. These prior beliefs
can be materialized by assigning different clique
potentials to various structures of the MBF. InFig. 4, one clique potential assigning scheme for
Fig. 3(a) is illustrated. Neighborhood for Fig. 3(b)
can be done in like manner: through a 90� rotationof Fig. 4. As can be seen the isolated MBF site (a)
is harshly penalized by giving a large potential,
whereas the smoothest possible boundary: a line in
(b) is encouraged by assigning the smallest po-
tential magnitude. Given that estimates of motion
vectors are determined by the similarity of two
macro-blocks or pixels together with the bias
of slowness, the MBF is indeed determined bythe combination of spatial as well as temporal
characteristics of the video. Furthermore, the as-
signment of clique potentials ensures desired geo-
metrical behavior of the sought-after visual object
boundary.
With the introduction of a boolean MBF, the
smoothness bias is turned off for velocities on
adjacent pixels should the MBF between themtake the value of 1�s. Another important issue
which has to be addressed is the paradox raised
by the MBF––the interdependencies between ve-
locities and MBF assignments: the assignment to
a site in the MBF relies on the velocities on the
two image pixels, while in estimating motion
vectors the enforcement of smoothness on these
two velocities is decided by the MRF assign-ments. To circumvent this problem, we go with
the same strategy as the aforementioned trunca-
tion function: initially fewer 1�s are assigned to
the MBF sites in order to allow for more content-
based interactions among adjacent pixels/macro-
sites. As processing goes on, the threshold to
declare a motion boundary is increased to cir-
cumvent the over-smoothness artifacts for thefinal estimates.
Fig. 4. A clique potential assigning scheme for the MBF, where violations of connectness and smoothness are penalized. Special care is
taken for three-site clique potential assignments due to their different smoothness: collinear three-site cliques are given smallest
penalties V3s; V3c refers to those three boundary sites close to line, i.e., 6 0 2, 5 0 1, 5 0 3, 2 0 4; V3v corresponds to those V -shape three-site cliques: 1 0 6, 6 0 5, 4 0 3, 3 0 2, 1 0 2; we make no difference for 4-, 5-, 6-, and 7-site clique potentials.
3132 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
3. Proposed visual object segmentation algorithm
In this section the new visual object segmenta-
tion algorithm is stipulated based on the principles
presented in the preceding section. First we shall
formulate MRF models used by our VO segmen-
tation algorithm, then the algorithm of VO seg-
mentation itself is given.
3.1. MRF model of the VO segmentation algorithm
In video processing, there are two regions pre-
sent in a target frame: one is those whose corre-
spondences can be found in the reference frame,
while the other is those which cannot find their
correspondences due to occlusion or newly present
scene. The former is referred to as predictable re-
gion Spredictable, and the latter is denoted by unpre-
dictable region Sunpredictable. For motion estimation,
two MRFs are needed: one is the motion field Dwhich assigns one velocity for each pixel or macro-
block, the other is the unpredictable field O indi-cating the presence of occlusion or newly present
scenes.
In the TPA, to effect efficient partition, a
‘‘double threshold’’ preprocessing step is carried
out: for one site or macro-block Mt in a target
frame, use block matching algorithm to locate a
site/macro-block Mr in the reference frame with
error energy e which minimizes the MSE. Based onthe magnitude of e, three fields are assigned:
1. If e is greater than a prescribed threshold d1 oflarger magnitude, Mt is deemed as belonging to
Sunpredictable.2. If e is smaller than another threshold d2, Mt is
labeled as Spredictable.3. Otherwise, Mt is categorized as an instance of a
third field referred to as uncertain Suncertain whosefinal assignment into predictable or unpredict-
able can only be obtained after the MFT opti-
mization process.
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3133
The previous partition results in the following
processing schemes:
1. sites in Sunpredictable are excluded in ensuing MFT
optimization procedure since no meaningful es-
timates for D or O can be achieved;2. the random variable corresponding to O is not
defined on sites belonging to Spredictable;3. only on Suncertain both random variables d for D
and o for O are defined.
Following other vision researcher�s lead in
MRF modeling (Chellappa and Jain, 1993; Li,
1995; Besag, 1986), the neighborhood systems Nfor both D and O are both chosen to be four-
connection or first-order (Sanz, 1996). Two types
of cliques thus present: single-site and double-site.
No difference is made for horizontal or vertical
double-site clique potentials in our implementa-
tion.
Our MRF based VO segmentation algorithm
therefore consists of three MRFs:
1. The vector motion field D: for each image pixel
ði; jÞ ~ddði; jÞ is an integer vector indicating the
corresponding motion vector. The use of bold
font is to emphasize the fact that~ddði; jÞ is a vec-
tor.
2. The boolean boundary field B: for each MBF
site between two pixels ði; jÞ and ði0; j0Þ bði;jÞ;ði0;j0Þsignifies if a boundary presents at between
ði; jÞ and ði0; j0Þ.3. The boolean unpredictable field O: for each
pixel ði; jÞ oði; jÞ indicates if ði; jÞ is unpredict-
able based on currently available knowledge.
Energy functions corresponding to these three
MRFs are given below.(1) Energy function of D:
Ui;jð~ddði; jÞÞ ¼ ð1� oði; jÞÞU~ddði;jÞ þ kdð1� oði; jÞÞ
�X
ði0 ;j0Þ2Nði;jÞ
ð1� oði0; j0ÞÞð1� bði;jÞ;ði0 ;j0ÞÞ
� k~ddði; jÞ � h~ddði0; j0Þik; ð7Þ
on the RHS, the first term is the single-site
clique potential, which measures the pixelwise or
blockwise difference between the pixel/block in the
target frame and the one in the reference frame
with a motion vector ~ddði; jÞ:U~ddði;jÞ ¼ jf t
i;j � f ri1;j1
j; ð8Þwhere ði1; j1Þ is the position ði; jÞ þ~ddði; jÞ, and the
reason to choose the MAD instead of the mean
square energy as a measure of the similarity be-
tween two sites/blocks is that the former is far
more resilient to outliers than the latter, hence
more robust results can be effected (Hampel et al.,
1986). The second term on the RHS of Eq. (7)corresponds to the smoothness of the velocity
which is enforced if the following three cases are
not present:
1. the current pixel/block does not belong to the
unpredictable field ðoði; jÞ ¼ 0Þ; and2. the neighbor is not an unpredictable site
ðoði0; j0Þ ¼ 0Þ; and finally3. the two adjacent sites belong to the same VO
ðbði;jÞ;ði0 ;j0Þ ¼ 0Þ.kd is a parameter which is used to adjust the
impact of smoothness: a larger magnitude in-
duces smoother velocity, and vice versa.
The MFT optimization process can be per-
formed using Eqs. (7), (5) and (2), which is theactual MAP search employed in (Wei and Li,
1999; Zhang, 1992). However, the corresponding
prior belief for velocities of this scheme is that they
are uniform, which is not in concert with the
slowness of human percepts. To enforce the prin-
ciple of slow velocity as a reasonable prior belief in
our Bayesian approach, more priorities should be
given to slow velocities. Therefore the followingmean field computation is employed by plugging
Eq. (2) into Eq. (5) with a weight assigning
scheme:
h~ddði; jÞi ¼ 1
Z
X~ddði;jÞ
~ddði; jÞ exp½�bwði; jÞUð~ddði; jÞÞ�;
ð9Þwhere wði; jÞ is a function I2 ! Iþ whose value is
in inverse proportion to eccentricity of j~ddði; jÞjfrom 0. For instance, it may be defined as:
wði; jÞ ¼ expð�j~ddði; jÞ �~ccjÞ; ð10Þ
3134 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
where constant vector~cc is taken to be ð1; 1ÞT, thusj~ddði; jÞ �~ccj is simply the sum of the absolute values
of the components of ~ddði; jÞ. The partition func-
tion Z of Eq. (9) is accordingly formulated as
below:
Z ¼X~ddði;jÞ
exp½�bwði; jÞUð~ddði; jÞÞ�: ð11Þ
(2) Energy function for O:The energy function for the unpredictable field
O is given below.
Ui;jðoði; jÞÞ ¼ oði; jÞ½Co � kpUh~ddði;jÞi�
þ kqX
ði0 ;j0Þ2Nði;jÞ
ð1� bði;jÞ;ði0;j0Þ
� joði; jÞ � hoði0; j0Þij: ð12Þ
The second term on the RHS of Eq. (12) is meantto handle the smoothness of the unpredictable
field.
(3) Energy function for B:
Uði;jÞ;ði0;j0Þðbði;jÞ;ði0 ;j0ÞÞ¼ ð1� bði;jÞ;ði0 ;j0ÞÞð1� oði; jÞÞð1� oði0; j0ÞÞg
� ð~ddði; jÞ;~ddði0; j0ÞÞ þ kbXc
Vc; ð13Þ
where gð�; �Þ is the truncation function, kb is a
control parameter regarding the smoothness of the
MBF, Vc�s are all possible clique potentials de-
picted in Fig. 4.
3.2. The VO segmentation algorithm
Based on MRF models as formulated previ-
ously, we are ready to present the VO segmenta-
tion algorithm which is to search the joint fixed
point for D, O, and B on the same track of the
TPA.
1. Carry out the ‘‘double thresholding’’ step to
partition sites in the target frame into
Spredictable, Suncertain, and Sunpredictable.2. (a) Use Eqs. (7) and (9) to compute the mean
field of D.(b) Use Eqs. (12) and (5) to compute the mean
field of O.
(c) Use Eqs. (13) and (5) to compute the mean
field of B.3. Compute the normalized difference ek of the
three mean fields for iteration k:
ek ¼ ½kh~ddðkÞi � h~ddðk�1Þik2 þ kOðkÞ � Oðk�1Þk2
þ kBðkÞ � Bðk�1Þk2�12=kSk: ð14Þ
If ek is smaller than a prescribed threshold �1 ordek ¼ ek � ek�1 is smaller than another given
threshold �2ð< �1Þ, exit; goto step (a) otherwise.4. Carry out component labeling algorithm (Jain
et al., 1995) to remove those linked MBFs
whose number is less than 5.
In the preceding algorithm, a joint fixed point
for the three involved MRF�s is deemed to be
achieved if ek or the change of ek between two
contiguous iterations dek is especially small.
4. Implementation details and experimental results
In this section, experimental results conducted
based on the algorithm developed in the preceding
section are reported. Control parameters presented
in the formulas of the segmentation algorithm are
given in Table 1. These values are mostly obtained
by trial and error. Monte Carlo studies are still
needed to arrive at optimal parameters. Excellent
discussions on this topic can be found in (Celeuxet al., 2003).
All frames of our experimental videos are 256
level gray scale images. Two sets of results will be
presented: one is synthetic image sequences where
the reference and target frames are generated with
known shape of object and velocities. The other is
videos downloaded from public domain. There is
no difference for our algorithm to work based onpixels or macro-blocks, in this section all results
are based on 4 · 4 macro-block.
4.1. Synthetic image sequences
In this section, first experiments conducted on
two synthetic image sequences are reported. One
contains a single moving object while the other hastwo moving objects. We then present statistical
studies on more synthetic videos.
Table 1
Values chosen for each parameter in our implementation
Parameter Chosen value Parameter Chosen value Parameter Chosen value
b 1.0 kq 6 kd 12.8
kb 4.5 cp1 40 cp2 10
cd maxf8e�i=8; 4g g cd=2 Co 16
�1 0.02 �2 0.01
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3135
Two 256 · 256 synthetic images are obtained inthe following way: an image I is generated by the
realization of an identically independent distrib-
uted Gaussian random process with mean value
l ¼ 128, and standard deviation r ¼ 10. Then a
64 · 64 square in the center of I is moved with a
motion vector ð2; 3Þ. The disoccluded area is filled
with samples from the same Gaussian random
process as that generating I . The resulting image isreferred to as I 0. Finally, white Gaussian noise with
r ¼ 1 is then added to both I and I 0. They are
viewed as the reference and target frame, respec-
tively, in the object segmentation process. These
two images and the resulting MBF is shown in Fig.
5.
A second synthetic image sequence consists of
two objects using the same Gaussian process andpatching scheme as the previous single object case.
Two 64 · 64 squares around the center move with
Fig. 5. Synthetic image sequences and their corresponding
MBF images. Row 1: one moving object, Row 2: two adjacent
moving objects. Column 1: reference frame, Column 2: target
frame, Column 3: MBF image.
the velocities ð0; 3Þ and ð0;�3Þ, respectively. Andagain white noise with r ¼ 1 is added to these two
images. The reference and target frames together
with the corresponding MBF image are illustrated
in Fig. 5. We arrived at these two estimation re-
sults within 15.67 and 15.71 s, respectively. Our
program works written in C++ under the OS of
RedHat Linux 7.3. The computer we work with is
Pentium III at 850 MHz with 130 MB RAM. Andwe used the clock (real) time returned by the sys-
tem call time as our timing results.
It can be observed from Fig. 5 that the MBFs
estimated from these two synthetic sequences are
of satisfactory performance.
In order to gain more insights into the perfor-
mances of our algorithm, a statistical study is
conducted on more synthetic images. We generate25 synthetic sequences with two moving objects in
like manner to row 2 of Fig. 5. These synthetic
sequences are generated by 25 different stds rang-
ing from 5 to 18 with interval 0.5. In Fig. 6, the
number of errors and their corresponding com-
putational time are demonstrated. It can be ob-
served that when the objects are of reasonable
textures, e.g., r > 6, the estimated boundaries areof high quality, namely, the accurate ratio is well
above 90% (1� 10/136) since the errors are less
than 10. Indeed when r > 12 it is extremely un-
likely from our experiments that there are more
than one error. Moreover, the processing time is
also relatively stable at less than 20 s in our com-
puter which, as aforementioned, is not with deep
computational power at all. However, when thetextures are too small, e.g., r < 6, the aperture
problem simply dominates our estimation and the
performance significantly drops (cf. the sharp in-
crease of errors when r < 6). In cases where r < 5
the estimates are too unstable and it is thus sta-
tistically meaningless to put them in the diagram.
5 7 9 11 13 15 1705
1015202530
MBF Errors for different std’s
0
10
20
30
40
50
7 9 11 13 15 17
Running time for different std’s
5
Fig. 6. Statistical results for 25 synthetic sequences with two moving objects. Horizontal axes for both diagrams indicate the std�s weused to generate the frames. Vertical axis for Column 1: number of MBF errors, Column 2: processing time measured in second.
3136 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
4.2. Real-world videos
Extensive experiments have been conducted on
public videos 2 In Fig. 7, four sets of experimental
videos and their MBF images are depicted. For
sequence Salesman, the estimated MBFs essen-
tially detected three regions of apparent motion:
his face and two arms. In sequence Table tennis,
the ball, the player�s two hands and the plate arereflected well by the MBFs. The MBFs for se-
quence claire largely revealed the talking face, the
additional boundaries present in the middle of her
face is due to the relatively larger motion of the
region around her nose. Due to the smoothness of
the region between her mouth and the chin, her
chin tip and the other part of her face are sepa-
rated. As for the MBFs for hall monitor, the upperboundary evidently corresponds to the upper part
of his torso: the internal boundary is caused by the
motion of his left arm––the disocclusion presents
between his left arm and his torso. Whereas the
lower half of his left leg with considerable motion
is covered by the other group of MBFs.
5. Conclusion
In this paper, in light of the slow and smooth
prior beliefs of the human motion percepts, we
develop a new visual object segmentation algo-
rithm using the MRF–MAP–MFT, namely, the
2 Videos used in this article are downloaded from http://
www.ipl.rpi.edu/resource/sequences/index.html, thanks to the
Center for Image Processing Research of Rensselaer Polytech-
nic Institute.
MRF is used to model contextual constraints of
video data, and the MAP configuration is soughtafter using the MFT as the desirable configuration.
Instead of the LF originally proposed in the dis-
cipline of image restoration to circumvent the
over-smoothness artifacts, we formulate a novel
MRF MBF to handle the appropriate smoothness
and discontinuity of velocities among adjacent
pixels/macro-blocks. The prior beliefs of a VO
boundary, i.e., continuity and smoothness, can bereadily embedded in the clique potentials upon
which the MFT optimization is conducted. Ex-
perimental results on both synthetic and real-
world scenes have demonstrated encouraging
performance.
The visual object boundary extracted by the
proposed algorithm is exactly the contour of an
alpha-plane as presented in MPEG-4 proposals(Zhang et al., 1997). It thus achieves the valuable
task of object retrieval. Automatic visual object
retrieval from raw video data is of vital importance
in the state-of-the-art research on disciplines of
video processing and digital libraries due to its
ability of providing semantically meaningful unit,
i.e., visual object, for efficient coding, indexing,
and manipulation in a broad spectrum of appli-cations such as video compression, video content-
based indexing and recognition, virtual reality,
video synthesis, etc. It is thus not a surprise that
the concept of visual object is of utmost impor-
tance in the ongoing standardization effort of
MPEG.
The proposed visual object segmentation algo-
rithm is not without problem: it can only extractmoving objects of distinguishable texture from the
background. This phenomenon has been demon-
Fig. 7. Experimental results conducted on publicly available real world videos. The first two columns: the two frames our algorithm is
performed to; the third column: the estimated MBF. The dimension and running time of our algorithm for each sequence are also given
after the sequence names.
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3137
strated by Fig. 6 where when the generating r is
less than 6 our algorithm broke down. More re-
fined method is still needed to alleviate this aper-
ture problem. Possible avenues are to embed more
a priori knowledge, in addition to general knowl-
edge like slow and smooth principles, about the
3138 J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139
scene and visual objects into our energy formula-
tion. For instance, bias to certain specific textures
or spatial statistical structures can be put into our
MRF model effectively.
Acknowledgements
This work is sponsored by Office of Naval Re-
search under contract no. N000140210122.
References
Aloimonos, J., Bandopadhay, A., Weiss, I., 1988. Active vision.
Internat. J. Comput. Vision 1 (4), 333–356.
Ballard, D.H., 1991. Animate vision. Artificial Intell. 48, 57–86.
Barnard, S., 1993. Stereo matching. In: Chellappa, R., Jain,
A.K. (Eds.), Markov Random Fields. Academic Press,
pp. 245–271.
Besag, J., 1986. On the statistical analysis of dirty pictures (with
discussions). J. Roy. Statist. Soc. Ser. B 48, 259–302.
Borshukov, G.D., Bozdagi, G., Altunbasak, Y., Tekalp, A.M.,
1997. Motion segmentation by multi-stage affine classifica-
tion. IEEE Trans. Image Process. 6 (11), 1581–1594.
Celeux, G., Forbes, F., Peyrard, N., 2003. EM procedures using
mean field-like approximations for Markov model-based
image segmentation. Pattern Recognition 36 (1), 131–144.
Chandler, D., 1987. Introduction to Modern Statistical Me-
chanics. Oxford University Press.
Change, S.-F. et al., 1997. VideoQ: An automated content-
based video search system using visual cues. In: Proc. ACM
Multimedia.
Chellappa, R., Jain, A.K., 1993. Markov Random Fields.
Academic Press.
Drew, M.S., Wei, J., Li, Z.N., 1998. On illumination invariance
in color object recognition. Pattern Recognition 31 (8),
1077–1087.
Flickner, M. et al., 1995. Query by image and video content:
The QBIC system. Proc. IEEE 83 (9), 23–31.
Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs
distributions, and the Bayesian restoration of images. IEEE
Trans. Pattern Anal. Machine Intell. 6 (6), 721–741.
Hager, G.D., Belhumeur, P.N., 1998. Efficient region tracking
with parametric models of geometry and illumination. IEEE
Trans. PAMI 20 (10), 1025–1039.
Hammersley, J.M., Clifford, P., 1971. Markov field on finite
graphs and lattices. Unpublished.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.,
1986. Robust Statistics––The Approach Based on Influence
Functions. John Wiley and Sons.
Healey, G., Slater, D., 1994. Global color constancy: Recog-
nition of objects by use of illumination-invariant properties
of color distribution. J. Opt. Soc. Am. A 11 (11), 3003–3010.
Hildreth, E.C., 1983. The Measurement of Visual Motion. MIT
Press.
Irani, M., Hsu, S., Anandan, P., 1995. Video compression
using mosaic representations. Signal Process.: Image Com-
mun. 7 (4), Special issue on coding techniques for low bit-
rate video.
Jain, J., Jain, A., 1981. Displacement measurement and its
applications in interframe image coding. IEEE Trans.
Commun. 29 (12), 1799–1808.
Jain, R., Kasturi, R., Schunck, B.G., 1995. Machine Vision.
McGraw-Hill.
Koga, T. et al., 1993. Motion compensated interframe coding
for video conferencing. In: Proc. Nat. Telecommun. Conf.
pp. 5.3.1–5.3.5.
Konrad, J., Dubois, E., 1992. Bayesian estimation of motion
vector fields. IEEE Trans. Pattern Anal. Machine Intell. 14,
910–927.
Krishnamachari, S., Chellappa, R., 1997. Multiresolution
Gauss–Markov random field models for texture segmenta-
tion. IEEE Trans. Image Process. 6 (2), 251–267.
Li, S.Z., 1995. Markov Random Field Modeling in Computer
Vision. Springer-Verlag.
Marr, D., 1982. Vision. W.H. Freeman.
Memin, E., Perez, P., 1998. Optical flow estimation and object-
based segmentation with robust techniques. IEEE Trans.
Image Process. 7 (5), 703–719.
Pentland, A., Picard, R.W., Sclaroff, S., 1996. Photobook:
Content-based manipulation of image databases. Internat.
J. Comput. Vision 18 (3), 233–254.
Picard, R.W., Pentland, A.P. (Eds.), 1996. Special issue on
digital library. IEEE Trans. PAMI 18(8).
Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and
Outlier Detection. Wiley.
Sanz, J.L.C. (Ed.), 1996. Image Technology. Springer-Verlag.
Srinivasan, R., Rao, K.R., 1985. Predictive coding based on
efficient motion estimation. IEEE Trans. Commun. 33, 888–
896.
Swain, M.J., Ballard, D.H., 1991. Color indexing. Internat. J.
Comput. Vision 7 (1), 11–32.
Ullman, S., 1979. The Interpretation of Visual Motion. The
MIT Press.
Wang, J.Y.A., Adelson, E.H., 1994. Representing moving
images with layers. IEEE Trans. Image Process. 3 (5),
625–638.
Wei, J., 2002. Color object indexing and recognition in
digital libraries. IEEE Trans. Image Process. 11 (8), 912–
922.
Wei, J., Li, Z.N., 1999. An efficient two-pass MAP–MRF
algorithm for motion estimation based on mean field theory.
IEEE Trans. Circ. Sys. Video Tech. 9 (6), 960–972.
Weiss, Y., Adelson, E.H., 1998. Slow and smooth: A Bayesian
theory for the combination of local motion signals in human
vision. Technical Report 1624, MIT AI Lab.
Woo, W., Ortega, A., 1996. Stereo image compression with
disparity compensation using the MRF model. In: Proc.
SPIE Visual Communication and Image Processing
(VCIP�96), vol. 2727. pp. 28–41.
J. Wei, I. Gertner / Pattern Recognition Letters 24 (2003) 3125–3139 3139
Yarbus, A.L., 1967. Eye Movements and Vision. Plenum, New
York.
Zhang, J., 1992. Mean field theory in EM procedures
for MRF�s. IEEE Trans. Signal Process. 40 (10), 2570–
2583.
Zhang, J., Hanauer, G.G., 1995. The application of mean field
theory to image motion estimation. IEEE Trans. Image
Process. 4 (1), 19–32.
Zhang, Y.-Q., Zafar, S., 1992. Motion-compensated wavelet
transform coding for color video compression. IEEE Trans.
Circ. Sys. Video Tech. 2 (3), 285–296.
Zhang, Y.-Q. et al. (Ed.), 1997. Special issues on MPEG-4.
IEEE Trans. Circ. Sys. Video Tech. 7(1).
Zoghlami, I., Faugeras, O., Deriche, R., 1997. Using geometric
corners to build a 2d mosaic from a set of images. In: Proc.
IEEE CVPR�97. pp. 420–425.