Moving object detection based on improved VIBE and...
Transcript of Moving object detection based on improved VIBE and...
Mo
JDM
a
ARA
KBMVMF
1
saposeoos
psdobGasm
0h
Optik 124 (2013) 6081– 6088
Contents lists available at ScienceDirect
Optik
jou rn al homepage: www.elsev ier .de / i j leo
oving object detection based on improved VIBE and graph cutptimization
ianfang Dou ∗, Jianxun Liepartment of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing,inistry of Education of China, Shanghai 200240, China
r t i c l e i n f o
rticle history:eceived 30 November 2012ccepted 28 April 2013
eywords:ackground model
a b s t r a c t
Extracting foreground moving objects from video sequences is an important task and also a hot topic incomputer vision and image processing. Segmentation results can be used in many object-based videoapplications such as object-based video coding, content-based video retrieval, intelligent video surveil-lance and video-based human–computer interaction. In this paper, we present a novel moving objectdetection method based on improved VIBE and graph cut method from monocular video sequences.Firstly, perform moving object detection for the current frame based on improved VIBE method to extract
oving object detectionIBEean shift
eature clustering
the background and foreground information; then obtain the clusters of foreground and backgroundrespectively using mean shift clustering on the background and foreground information; Third, initializethe S/T Network with corresponding image pixels as nodes (except S/T node); calculate the data andsmoothness term of graph; finally, use max flow/minimum cut to segmentation S/T network to extractthe motion objects. Experimental results on indoor and outdoor videos demonstrate the efficiency of ourproposed method.
. Introduction
Moving objects detection for a static camera has been exten-ively studied for many years [1,2]. Moving objects detection plays
very important role in many vision applications with the pur-ose of subtracting interesting target area and locating the movingbjects from image sequences. It is widely used in vision systemsuch as traffic control, video surveillance of unattended outdoornvironments, video surveillance of objects, activity recognition,bject tracking and behavior understanding. Accurate movingbject detection is essential for the robustness of intelligent video-urveillance systems.
Background subtraction and temporal differencing are twoopular approaches to segment moving objects in an imageequence under a stationary camera. Background subtractionetects moving objects in an image by evaluating the differencef pixel features of the current scene image against the referenceackground image, such as the Gaussian mixture model (GMM) [3].MM is a widely used approach because of its self learning capacity
nd its robustness to variations in lighting. However, it still hasome shortcomings. One problem is that it does not explicitlyodel the spatial dependencies of neighboring background pixels’∗ Corresponding author.E-mail address: specialdays [email protected] (J. Dou).
030-4026/$ – see front matter © 2013 Elsevier GmbH. All rights reserved.ttp://dx.doi.org/10.1016/j.ijleo.2013.04.106
© 2013 Elsevier GmbH. All rights reserved.
colors. Therefore, some false positive pixels will be produced inhighly dynamic scenes where dynamic texture does not repeatexactly. Temporal differencing such as W4 [4] use three parametersto model each pixel of background: the maximum of pixel gray,minimum, and the mean of two maximum differences of twosuccessive frames in a period time. However, It is very effectiveto accommodate environmental changes and generally can onlyrecover partial edge shapes of moving objects. For non-stationarycameras, optical flow method can be used, which assign to everypixel a 2-D velocity vector over a sequence of images. Movingobjects are then detected based on the characteristics of the veloc-ity vectors. Optical flow methods are computationally intensive,and can only detect partial edge shapes of moving objects. Toimprove the accuracy of the background model and to deal withhighly dynamic scenes, the spatial information is exploited. Li [5]use spatial information at feature level, such as color and gradient.This can improve the accuracy of the background model and ismost suitable for the stationary background. Olivier Barnich andMarc Van Droogenbroeck proposed Vibe algorithm [6,7]. Themethod adopt neighboring pixels to create the background model,by comparing the background model and the current input pixelvalues to detect targets, the method also gives three steps to updatethe field to adapt to changes in the environment. It uses randomly
selected old samples to approximate the color distribution of thebackground. As described by the author, the advantage of therandomization is that it avoids replacement of the oldest samples.However, when the background have similar color with foreground6 124 (2013) 6081– 6088
(pagdr
mtdpisaistf
VrbvcgfitSutv
wlumis
2
snpi
ptaabsdslBt
B
wi
w
Input Video
Improved VIBEBackground Model
Mean shift
Clustering Learning Tlink Nlink
Initial S/TNetwork
Model
Construction
Graph Cuts ModelMax flow
/minimum cut
Result of ForegroundSegmenation
Foreground BackgroundImageImage
Image
Foreground
Background
Foreground
Mean ShiftBackground
Mean Shift
082 J. Dou, J. Li / Optik
i.e. camouflage problem), it will result in missing the foregroundoint and thus cause lack of accurate contour for post motionnalysis such as recognition. And it also but slowed to eliminatehost region. When the foreground object through these areas, theetection accuracy will drop; at the same time, goes up the highate of false detection algorithm because of the ghosting region.
Recently, the graph-cut algorithm has been used to detect videooving objects though the energy minimization technique under
he framework of MAP-MRF (Maximum a posterior-Markov ran-om field). Most graph-cuts [8,9] algorithms focus on the iterativerocess and the priori information of moving objects in order to
mprove the detection accuracy. Stereo-based segmentation [10]eems to achieve the most robust results by fusing color, contrastnd stereo matching information. But it requires special stereonput. However, graph cut requires labeling of the source and sinkeeds by a human operator. Through our improved VIBE method,his can get initial foreground region and then provide the seeds oforeground and background.
In this paper, in order to overcome the shortcoming of originIBE and provide accurate silhouette with good spatial and tempo-al consistency, we present a novel moving object detection methodased on improved VIBE and graph cut method from monocularideo sequences. Firstly, perform moving object detection for theurrent frame based on improved VIBE method to extract the back-round and foreground information; Then obtain the clusters oforeground and background respectively using mean shift cluster-ng on the background and foreground information; Third, initializehe S/T Network with corresponding image pixels as nodes (except/T node); Calculate the data and smoothness term of graph; Finally,se max flow/minimum cut to segmentation S/T network to extracthe motion objects. Experimental results on indoor and outdoorideos demonstrate the efficiency of our proposed method.
The paper is organized as follows: Section 2 reviews the previousork related to layer extraction. Section 3 addresses how to extract
ayer descriptions from a short video clip. Section 4 deals with these of the occlusion order constraint, three-state pixel graph, and aultiframe graph cuts algorithm for obtaining layer segmentation
n the presence of occlusion. Finally, in Section 5, we demonstrateeveral results obtained by our method.
. Review of VIBE algorithm
Visual Background Extractor (ViBe) as a universal backgroundubtraction method is proposed by Olivier. The algorithm adoptseighboring pixels to establish the background model by com-aring the background model with the current pixel value. The
mplementation of the algorithm is subdivided into three steps [7]:The first step is to initialize the single frame image from each
ixel in the background model. Since there is no temporal informa-ion in a single frame, it is supposed that neighboring pixels share
similar temporal distribution. This means that the value of a pixelnd its neighbor pixel values in spatial domain has a similar distri-ution. The right size range of the neighborhood is chosen to makeure that the background model includes a sufficient number ofifferent samples, while keeping in mind that as the neighborhoodcope increases, the correlation between pixel values at differentocations decreases. It is assumed that T = 0 denotes the first frame;G0(x,y), NG(x,y), P0(x,y) is the pixel background model value, spa-ial neighborhood value, pixel value, then:
G0(x, y) = {P0m(xn, yn)|(xn, yn) ∈ NG(x, y)} (1)
here (xn,yn) in NG(x,y) is selected such as probability. m = 1, 2, . . ., Ns the number of samples.
Secondly there is a simple decision process to determinehether an input pixel belongs to the background or not, then to
Fig. 1. Flowchart foreground segmentation algorithm GC STVIVE.
update the background pixel model. They denote by the pixel valueP(x,y) in a given Euclidean color space taken by the pixel located at(x,y) in the image, and each background model is modeled by acollection of 20 background samples. When T = t, the backgroundmodel of (x,y) is BGt−1(x,y), the pixel value is Pt(x,y), the formulafor determining the input frame image for foreground objects seg-mentation is as follows:
Pt(x, y) ={
foreground |BGt−1(x, y) − Pt(x, y)| > R
background |BGt−1(x, y) − Pt(x, y)| < R(2)
where superscript r is randomly chosen; R is the fixed shold (whichin color space is a spherical radius). When Pt(x,y) is larger than orequal to a given background candidate #min (the minimal cardinal-ity), we think it is the corresponding background pixel, otherwiseis foreground. In ViBe algorithm, #min is set to 2. It means that it issufficient to find at least two samples close enough to classify thepixel in the background.
Because the existence of a moving object in the first frame willintroduce an artifact commonly called a ghost (regions of con-nected foreground points that do not correspond to any real object).Although using the subsequent frames can update the backgroundto remove the ghost, the process is a little slow. Because of theexistence of ghost, the accuracy of detection will be decreased.Otherwise, the detection result of ViBe algorithm is sensitive toillumination changes, dynamic scenes, cluttered background.
A good background subtraction method has to be adapted tochanges of the background caused by different lighting conditionsbut also to those caused by the addition, removal, or displacementof some of its parts, camera jitter.
The third step is to randomly update the background modelwith each new frame. Because of the strong statistical correlationbetween a pixel and its neighbor pixel, when a pixel is detectedas the background pixel, its neighbor pixel is highly possible tobe considered as the background pixel with high possibility. Con-sequently, it allows using the background pixels to update thebackground model of neighboring pixels. Through the randomupdating policy we can merge the foreground objects which haltsuddenly or stay long into the background model.
3. Our proposed algorithm
The block diagram of our algorithm is demonstrated by Fig. 1.
The steps of our proposed approach – foreground segmenta-tion for moving object in this paper can be briefly summarized asfollows:
J. Dou, J. Li / Optik 124 (2013) 6081– 6088 6083
C0C1CN C2...
SSDN0<thr
cnt=0
SSD20<thr cnt=cnt+1
YES
cnt=cnt+1
YES
cnt>=2YES
NO
FOREGROUND
BACKGROUND
(
(
(
(
(
(
(
G
3
3
ooasttat
o
0 0 0
0 0 00
and ck,d > 0)is a kernel profile, which is a monotonically decreas-ing function and H the bandwidth matrix. Introducing the notation
Fig. 2. Illustration of improved foreground detection.
1) Firstly, perform moving object detection for the current framebased on improved VIBE method to extract the background IB
(p,t)
and foreground information IF(p,t);
2) For any pixel p in the current frame, if p is foreground pixel, thenadd it to PF, PF is the set of foreground pixel; if p is backgroundpixel, then add it to PB, PB is the set of background pixel;
3) Then obtain the clusters of foreground and background {RFi}(i =
1, 2, . . ., m) and {RBj}(j = 1, 2, . . ., n) respectively using mean
shift clustering on the pixel of PF and PB;4) Initialize the S/T Network model with corresponding image pix-
els as nodes (except S/T node);5) Calculate Tlink and Nlink to build likelihood energy function
according to Eqs. (16)–(19) and construct Graph Cut model;6) Use max flow/minimum cut to segmentation S/T network to get
binary label of each node.7) If one node is labeled S, then its corresponding image pixel
is foreground; other wise is background, then get the currentforeground object mask.
We will give a detail description of our proposed algorithmC STVIBE in the following section.
.1. Improved VIBE
.1.1. Foreground detectionFor the second step of classify the current pixel as foreground
r background, the origin VIBE is based on pixel level. Based onur experiment, we found that if background have similar colors foreground, the foreground cannot detected correctly, that is toay, the boundary of the detection is inaccurate and is not good forhe post processing such as perform object tracking and recogni-ion. In order to improved this limitation, we detect the foregroundt region level. Fig. 2 illustration of improved foreground detec-
ion.Denote C0 is the current frame pixel at (x,y), R0 is 3 × 3 neighborf C0; Ci(i = 1, 2, . . ., N) is the corresponding position of background
0 0 0
Fig. 3. Illustration of improved background model updating.
model, Ri(i = 1, 2, . . ., N) is 3 × 3 neighbor of Ci. N is the number ofsamples.
SSDi0(Ri, R0)=n=1∑
n=−1
m=1∑m=−1
||R0(x + n, y + m) − Ri(x + n, y + m)||2 (3)
where i = 1, 2, . . ., N. cnt is used to count the number of sampleswhich meet SSDi0(Ri, R0) < thr.
We redefine the formula of foreground object segmentation asbelow:
cnt ={
cnt = cnt + 1 SSDi0(Ri, R0) < thr
cnt SSDi0(Ri, R0) ≥ thr(4)
Pt(x, y) ={
foreground cnt < 2
background cnt ≥ 2(5)
3.1.2. Background model updateIn the third step of origin VIBE algorithm, the updating of the
background model is also based on pixel level instead of regionlevel, this does not consider the neighbor statistics efficiently andmay cause discontinuity for the background pixel, thus increasethe missing foreground. In our improved method, we update thebackground model at region level, as 3 × 3 neighbor of pixel. This isshown in Fig. 3.
Suppose current pixel fg(x,y) for the detected foreground mask,we will update the corresponding background model pixel if thesecondition below meets:
sum P =n=1∑
n=−1
m=1∑m=−1
fg(x + n, y + m) (6)
Update(Modelp(x, y)) ={
1 if sum P = 0
0 else(7)
3.2. Mean shift clustering
3.2.1. Mean shift for mixed feature spacesAn appealing technique to extract the clusters is the mean shift
(MS) algorithm, which does not require to fix the (maximum) num-ber of clusters. On the other hand the kernel bandwidth and shapefor each dimension has to be chosen or estimated. Mean shift is aniterative gradient ascent method used to locate the density modesof a cloud of points, i.e. the local maxima of its density [13]. In[13], two applications of the feature-space analysis technique arediscussed based on the MS procedure: discontinuity preserving fil-tering and the segmentation of gray-level or color images. Herethe theory is briefly reminded [11–13]. Given the set of points{X(i)}i=1,2,...,M in the d-dimensional space Rd, the non-parametricdensity estimation at each point x is given by:
fH,k(X) = 1
n(2�)d/2||H||1/2
M∑i=1
k(||H−1/2(X − X(i))||2) (8)
where k(x)(∫ ∞
0ck,dk(||x||2)dx = 1,), ck,d is normalization constant
g(X) = −k′(X) (9)
6084 J. Dou, J. Li / Optik 124 (
Fig. 4. An example of a graph G for a 1D image. Nodes p, q, r, and o correspond totio
l
∇w
m
cG
m
w
D
i
aaa.tH
m
w
omt
mcsdtsatp
123
4
information of the current frame and obtain foreground clusters{RF
i}(i = 1, 2, . . ., m), background clusters {RB
j}(j = 1, 2, . . ., n), m
is the number of foreground clusters and n is the number of
he pixels in the image. After computing a minimum cut C, the nodes are partitionednto supporting pixels p, q (source) and unsupported pixels r, o (sink). The weightsf the links are listed in the table on the right.
eads to the density gradient:
fH,k(x) = H−1fH,g(X)mH,g(X) (10)
here mH,g is the “mean shift” vector,
H,g(X) =∑M
i=1X(i)g(||H−1/2(X − X(i))||2∑Mi=1g(||H−1/2(X − X(i))||2)
− X (11)
Using exactly this displacement vector at each step guarantiesonvergence to the local maximum of the density. With a d-variateaussian kernel, Eq. (11) becomes
H,g(X) =∑M
i=1X(i) exp(−(1/2)D2(X, X(i); H))∑Mi=1 exp(−(1/2)D2(X, X(i); H))
− X (12)
here
2(X, X(i); H) ≡ (X − X(i))H−1(X − X(i)) (13)
s the Mahalanobis distance from X to X(i).Assume now that the d-dimensional space can be decomposed
s the Cartesian product of S (2 in our case) independent spacesssociated to different types of information (e.g. position, color),lso called feature spaces or domains, with dimensions ds, s = 1, 2,
. ., S (where∑S
s=1ds = d). Because the different types of informa-ion are independent, the bandwidth matrix H becomes H = diag[H1,2, . . ., Hs] and thus the mean shift vector can be rewritten as
H,g(X) =∑M
i=1X(i)∏S
s=1 exp(−(1/2)D2(Xs, X(i)s ; Hs))∑M
i=1
∏Ss=1 exp(−(1/2)D2(Xs, X(i)
s ; Hs))− X (14)
here X(i)T = (X(i)T
1 , X(i)T
2 , . . ., X(i)T
S ) and XT = (XT1 , XT
2 , . . ., XTS ).
The mean shift filtering is obtained by successive computationsf Eq. (12) or Eq. (14) and translation of the kernel according to theean shift vector. This procedure converges to the local mode of
he density [13].For color image segmentation, the feature vector adopted in
ean shift clustering is the concatenation of the two spatial domainoordinates and the three color values in a given color space. Meanhift procedure is applied in the joint spatial-color domains. Theistance between feature vectors is measured by the Euclidean dis-ance. CIE L*u*v* color space is usually preferred to the RGB colorpace since Euclidean distance defined in the L*u*v* space betterpproximates human perception of color distance. The segmenta-ion is actually a merging process performed on a region that isroduced by the MS filtering.
Mean shift algorithm is described as the following:
) Choose the radius h of the search window.
) Choose the initial location of the window.) Compute the mean shift vector and translate the search windowby that amount.) Repeat till convergence.
2013) 6081– 6088
3.2.2. Bandwidth selectionThe partition of the feature space is obtained by grouping
together all the data points whose associated mean shift proce-dures converged to the same mode. The quality of the resultshighly depends on the choice of the bandwidth matrix H. In [12],Comaniciu proposes to find the best bandwidths within a range ofB predefined matrices {H(b), b = 1, 2, . . ., B}. Mean Shift partitioningis first run at each scale (for b varying from 1 to B). For each datapoint X(i), an analysis of the sequence of clusters to which the pointis associated is performed. The scale for which the cluster is themost stable is selected, along with associated bandwidth, for datapoint X(i). Therefore, the algorithm can be decomposed in two steps.The first one is called bandwidth evaluation at the partition level.It consists in finding a parametric representation of each clusterin order to do the comparisons. The second step called evaluationat the data level is the analysis of cluster sequences at each datapoint.
3.3. Unsupervised learning based on mean shift clustering
Extract the background and foreground information of currenttime T using improved VIBE method, then foreground informationis labeled as white while the background is labeled as black.We use mean shift to cluster on the background and foreground
Fig. 5. Curtain (Frame 1847) comparative foreground segmentation methods. (a)Input image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
J. Dou, J. Li / Optik 124 (2013) 6081– 6088 6085
FI
bT
3
FTaiatewnettgassgomsit
neighbor system, f is the label of a pixel, and T(·) is 1 if its argument
ig. 6. Curtain (Frame 2266) comparative foreground segmentation methods. (a)nput image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
ackground clusters. F denotes as foreground, B for background.he feature vector in mean shift clustering is [x, y, R, G, B].
.4. Graph Cut
In this paper, we use terminology and notations similar to [9].or example, Fig. 4 shows a typical weighted graph with four nodes.his graph G =
⟨V, ε
⟩is defined by a set of nodes V (image pixels)
nd a set of undirected edges E which connect these nodes as shownn Fig. 4. In this graph, there are two distinct nodes (or terminals) snd t, called the source and sink, respectively. The edges connectedo the source or sink are called t-links, such as (s,p) and (p,t). Thedges connected to two neighboring pixel nodes are called n-links,hich are bidirectional, such as (p,q) and (q,p). The weights of these
-links in both directions may not be equal. A cut C is a subset ofdges which separates the nodes into two parts; one part belongso the source and the other belongs to the sink. The cost of a cut ishe summation of the weights of its edges. The minimum cut of theraph will generate an optimal segmentation in the image. In thepplication, the given source and sink to the graph cuts. A cut with/t on a graph is to set two disjoint subsets S and T such that theource s is in S and the sink t is in T. the minimum cut problem on araph is to find a cut that has the minimum cost among all cuts. Onef the fundamental results in combinatorial optimization is that the
inimum s/t cut can be solved by find the maximum flow from theource s to the sink t. generally speaking, maximum flow is the max-mum amount of water that can be sent from the source s to the sink
with the graph edge is the pipe and edge weight is the capacity of
Fig. 7. Curtain (Frame 2801) comparative foreground segmentation methods. (a)Input image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
the pipe. The experiment [14] shows the maximum flow is equal tothe minimum cut. Most of the algorithms are push-relabel or aug-menting paths to find the minimum cut. Augmenting paths alwayspush flow along the unsaturated paths from the source to the sinkuntil the maximum flow in graph G is reached. Push-relabel methoduses quite a different approach. They do not valid flow during theoperation, there are active nodes that have a positive weight thatallow the flow access, the algorithm use a labeling of nodes givinga low bound estimate on the distance to the sink on unsaturatedpath. We will use the first method to do our approach.
Given a labeling system f, each pixel in the image will be assignedone label. The graph cuts algorithm seek energy minimum in thecomputer vision. This problem can be naturally formulated in theterms of energy minimum. In this framework, it seek the label f thatminimum the energy as:
E = Edata(f ) + Esmooth(f )
=∑p ∈ P
D(fp, p) +∑
(p,q) ∈ N
V(p, q) · T(fp /= fq) (15)
where is Edata a data error term, Esmooth is a piecewise smoothnessterm, P is the set of pixels in the image, D(fp,p) is a data penalty func-tion when it is assigned a label fp, V(p,q) is a smoothness penaltyfunction between two neighboring pixels p and q, N is a four-
p
is true and 0 otherwise. In this bi-partitioning problem, the label fpis either 0 or 1. If fp == 1, the pixel p is supporting the seed region,otherwise, this pixel is not supporting the region.
6086 J. Dou, J. Li / Optik 124 (2013) 6081– 6088
FI
3
3
biem
3
ma
ig. 8. Curtain (Frame 2893) comparative foreground segmentation methods. (a)nput image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
.5. Constructing network flow graph
.5.1. Initialize the S/T network modelIn the S/T Network model, S node denotes object, T node denotes
ackground. They are not exist in fact and do not correspond-ng any pixel in the image. For other nodes, one of each nodexpress the corresponding pixel in the image and is one to oneapping.
.5.2. Calculate the weight of TlinkFirstly, for each node p (except S/T node), calculates the
inimum distance and the background and foreground clustersccording to Eqs. (16) and (17).
ACMLI GMM VIBE GC_STVIBE0
100
200
300
400
500
600
700
800
900
Frm1847
Frm2266Frm280 1
Frm289 3
Fig. 9. Number of true positives of curtain four frames.
Fig. 10. Water Surface (Frame 499) comparative foreground segmentation methods.(a) Input image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
dFp = min ||C(p) − RF
i || (16)
dBp = min ||C(p) − RB
j || (17)
where dFp is the minimum distance between node p and foreground
clusters, dBp is the minimum distance between node p and back-
ground clusters. Calculate s-tlink and t-tlink of each node accordingto Eq. (18).
Vp(s − tlink) = dBp
dBp + dF
p
· � Vp(t − tlink) = 0 ∀p ∈ F
Vp(t − tlink) = dFp
dBp + dF
p
· � Vp(s − tlink) = 0 ∀p ∈ B
(18)
where Vp(s − tlink) is the weight of arc between node p and nodeS, Vp(t − tlink) is the weight of arc between node p and node T. � isthe weight coefficients.
3.5.3. Calculate the weight of NlinkNlink is the arc for nodes p and q with the neighbor relationship.
For the neighboring nodes p and q, Nlink is calculated according toEq. (19).
Vp,q(nlink) = 1(1 + exp(||C(p) − C(q)||2)
(19)
where Vp,q(nlink) is weight of neighboring nodes p and q.
J. Dou, J. Li / Optik 124 (2013) 6081– 6088 6087
F(
fn
4
asa
[tva
4
wr“
(
(
result of GMM, (d) the result of VIBE, (e) the result of the proposedmethod. Figs. 9 and 14 are the illustrations of TP for four exampleframes of Curtain and Water Surface videos.
Table 1TP of curtain.
ig. 11. Water Surface (Frame 559) comparative foreground segmentation methods.a) Input image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
After build the Graph Cut model, we can get the result oforeground segmentation through max flow/minimum cut for S/Tetwork.
. Experiment and results
In this section, we experimentally test some result in the indoornd outdoor scenes with moving object these situation will con-ider in our paper, the main goal is to prove our method is effectivend result is relative accuracy.
The indoor and outdoor video frames are obtained from web15]. We then compare our proposed GC STViBe method withhree existing algorithms on indoor and outdoor sequences. Theisual examples and quantitative evaluations of the experimentsre described in the following subsections.
.1. Quantitative evaluations
Quantitative evaluation of proposed method and comparisonith three existing methods were also performed in this study. The
esults were evaluated quantitatively from the comparison with theground truths” in terms of:
1) The Number of False Positive (FP): the number of backgroundpoints that are mis-detected as foreground.
2) The Number of True Positives (TP): the number of foregroundpoints that are detected as foreground points.
Fig. 12. Water Surface (Frame 577) comparative foreground segmentation methods.(a) Input image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
4.2. Experiment results
Curtain video is one of Office environments of indoor view. Anoffice environment is usually composed of stationary backgroundobjects. The difficulties for foreground detection in these scenescan be caused by shadows, changes of illumination conditions, andcamouflage foreground objects (i.e. the color of the foregroundobject is similar to that of the covered background). In some cases,background may consist of dynamic objects, such as waving cur-tains, running fans, and flickering screens. Water Surface is anothertype of environment of outdoor view. Changes for this type in thebackground are often caused by motion of water, or the changes inthe weather.
The segmentation results for four example frames of Cur-tain video and Water Surface video are shown in Figs. 5–8 andFigs. 10–13. In each figure, the displayed images are: (a) a framefrom the video, (b) ground truth, (b) the result of ACMLI, (c) the
Method frames ACMLI GMM VIBE GC STVIBE
1847 (518) (718) (569) (883)2266 (154) (613) (386) (774)2801 (323) (818) (517) (851)2893 (98) (779) (475) (899)
6088 J. Dou, J. Li / Optik 124 (
Fig. 13. Water Surface (Frame 619) comparative foreground segmentation methods.(a) Input image; (b) ground-truth; (c) ACMLI; (d) GMM; (e) VIBE; (f) GC STVIBE.
Table 2TP of Water Surface.
Method frames ACMLI GMM VIBE GC STVIBE
499 (330) (430) (405) (696)559 (294) (580) (373) (656)577 (215) (621) (447) (728)619 (186) (588) (435) (653)
ACMLI GMM VIBE GC_STVIBE0
100
200
300
400
500
600
700
800
Frm49 9
Frm559
Frm57 7
Frm61 9
Fig. 14. Number of true positives of Water Surface four frames.
[
[
[
[
[
[
2013) 6081– 6088
The quantitative evaluation results over the examples displayedin Figs. 5–8 and Figs. 10–13 are shown in Tables 1 and 2 respectively.The quantitative evaluation agrees with the conclusions from thevisual observation of the experimental results (Fig. 14).
5. Conclusion
In this paper, we present a novel moving object detectionmethod based on improved VIBE and graph cut method frommonocular video sequences. Firstly, perform moving object detec-tion for the current frame based on improved VIBE method toextract the background and foreground information; then obtainthe clusters of foreground and background respectively using meanshift clustering on the background and foreground information;third, initialize the S/T Network with corresponding image pix-els as nodes (except S/T node); calculate the data and smoothnessterm of graph; finally, use max flow/minimum cut to segmen-tation S/T network to extract the motion objects. Experimentalresults show that the quantitative detection indicators of theproposed algorithm perform better in indoor and outdoor condi-tions.
Acknowledgments
This work was jointly supported by National Natural Sci-ence Foundation (61175008); 973 Project (2009CB824900,2010CB734103); the Space Foundation of Supporting-Technologyno. 2011-HT-SHJD002.
References
[1] T. Bouwmans, Recent advanced statistical background modeling for foregrounddetection: a systematic survey, RPCS 4 (3) (2011) 147–176.
[2] T. Bouwmans, F. El Baf, B. Vachon, Statistical background modeling for fore-ground detection: a survey, in: Handbook of Pattern Recognition and ComputerVision (volume 4), World Scientific Publishing, Singapore, 2010, pp. 181–199(Ch. 3).
[3] C. Stauffer, E. Grimson, Learning patterns of activity using realtime tracking,IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 747–757.
[4] I. Haritaoglu, D. Harwood, L. Davis, W4: real-time surveillance of peopleand their activities, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000)809–830.
[5] L. Li, W. Huang, I. Gu, Q. Tian, Foreground object detection from videos con-taining complex background, in: ACM International Conference on Multimedia,ACM, Berkeley, USA, November, 2003, pp. 2–10.
[6] O. Barnich, M. Van Droogenbroeck, ViBe: a powerful random technique toestimate the background in video sequences, in: International Conference onAcoustics, Speech, and Signal Processing, April, 2009, pp. 945–948.
[7] O. Barnich, M. Van Droogenbroeck, ViBe: a universal background subtrac-tion algorithm for video sequences, IEEE Trans. Image Process. 20 (6) (2011)1709–1724.
[8] C. Rother, V. Kolmogorov, A. Blake, GrabCut—interactive foreground extractionusing iterated graph cuts, ACM Trans. Graph. 23 (3) (2004) 309–314.
[9] Y. Boykov, M.P. Jolly, Interactive graph cuts for optimal boundary and regionsegmentation of objects in N-D images, in: Proc. IEEE Intl. Conf. on ComputerVision, Vancouver, British Columbia, Canada, 2001, pp. 105–112.
10] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, C. Rother, Probabilistic fusionof stereo with color and contrast for bi-layer segmentation, IEEE Trans. PatternAnal. Mach. Intell. 28 (9) (2006) 1480–1492.
11] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal.Mach. Intell. 17 (8) (1995) 790–799.
12] D. Comaniciu, An algorithm for data-driven bandwidth selection, IEEE Trans.Pattern Anal. Mach. Intell. 25 (2) (2003) 281–288.
13] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature spaceanalysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619.
14] L. Ford, D. Fulkerson, Flows in Networks, Princeton Univ Press., Princeton, NewJersey, USA, 1962.
15] http://perception.i2r.a-star.edu.sg