[IEEE 2011 IEEE International Conference on Multimedia and Expo (ICME) - Barcelona, Spain...
Transcript of [IEEE 2011 IEEE International Conference on Multimedia and Expo (ICME) - Barcelona, Spain...
NEW FRAME RATE UP-CONVERSION BASED ON
FOREGROUND/BACKGROUND SEGMENTATION
Mehmet Mutlu Cekic and Ulug Bayazit
Istanbul Technical University
[email protected], [email protected]
ABSTRACT
Frame rate up-conversion (FRUC) increases the quality of a
video by increasing its temporal frequency. Motion
compensated and non-motion compensated frame rate up
conversion techniques make up the two main classes of
techniques used in this area. Halo artifacts and jaggy edges
cause the quality of video to be reduced both subjectively
and objectively in these techniques. In this paper, we
introduce a new method of motion compensated FRUC that
uses foreground background segmentation to address these
problems. In typical motion compensated FRUC, motion
vectors of existent frames are scaled to define the motion
vectors of interpolated frames. In the proposed method,
those pixels of an interpolated frame with no
unidirectionally assigned motion vectors are assigned
motion vectors by interpolation based on known motion
vectors of their neighbors and their foreground or
background identity. The proposed method significantly
improves the video quality and yields better PSNR results
than the methods of [1] and [2].
Index Terms— — frame rate up-conversion; motion
vectors, segmentation; background subtraction
1. INTRODUCTION
Advances in display technologies of televisions or
monitors take place almost every day. Recently, the
popularity of high definition (HD) TV’s has made the
100Hz-120Hz video display technology a desirable feature.
Due to the limited bandwidth of the transmission network,
video is generally captured and encoded at a low frame rate
before being broadcast, but needs to be converted to a higher
frame rate video at the receiver with the purpose of
displaying fast motion seamlessly. For instance, a video
decoded at 50 frames per second (fps) can be converted to
100 fps or 200 fps video before being displayed. This
technique is called frame rate up conversion (FRUC).
There are two types of FRUC methods used in video
processing. These are non-motion compensated and motion
compensated video frame rate up conversion (MC-FRUC).
Non-motion compensated FRUC doesn’t exploit the motion
of the objects. It is implemented as either frame repetition or
linear interpolation of the frames [3]. Due to the lack of
adequate computational resources, this technique was
popular in the past. On the down side, it causes artifacts in
the video like motion blur or motion judder.
In the last ten years, motion compensated video frame
rate up conversion methods with higher complexity could be
supported due to the developments in the microchip
technology for receiver end processors. When compared to
the non-motion compensated methods, these methods
efficiently reduce the occurrence of the aforementioned
artifacts resulting in a higher interpolated frame quality.
Motion estimation (ME) and motion compensation (MC)
make up the two main stages of motion compensated frame
rate up conversion.
In the ME stage, the motion vectors (MV) are determined
by using typically two successive, reconstructed frames that
are available following video decoding. Generally, the block
matching (BM) method is used for motion estimation where
the minimized error measure between the target and
displaced blocks in the reconstructed frames is typically the
sum of absolute differences (SAD) measure.
Block matching yields acceptable video coding
performance. However, it does not yield vectors
representing true motion which are very important for
constructing high quality interpolated frames. 3-D Recursive
Search Block Matching introduced by De Haan [4]
addresses and solves this deficiency of block matching.
Some other techniques like ME for overlapped motion
vectors [5], motion vector classification [1], correlation of
motion vectors [6] and adaptive true motion estimation [7]
are also used to overcome ME problems in obtaining true
motion vectors. Techniques which improve ME performance
over boundaries of regions or blocks with different motion
such as Variable Size Block Matching [8] and Adaptive
Overlapped Block Matching Compensation [9] have been
reported to also improve the quality of the interpolated
frames.
In the motion compensated interpolation stage, a new
frame between two successive frames is constructed via
estimated MV’s. While this method is being applied some
problems can be encountered like halo artifacts or jaggy
978-1-61284-350-6/11/$26.00 ©2011 IEEE
edges at occlusion areas. Handling occlusion artifacts is not
a trivial problem and many solutions have been proposed to
decrease these artifacts. The most popular of these is based
on using occlusion map as suggested by [10] and [11].
Another approach is constructing color segmentation maps
obtained by a Markovian approach [12]. Ince [13] proposed
a new method called geometry based estimation which
overcomes occlusion artifacts by using geometric
mismatches.
Identifying moving objects is a very popular problem in
computer vision. The most popular solution is background
subtraction. Typically, stationary background pixels are
determined and then subtracted from the whole frame in this
technique. Thereby, pixels of moving foreground objects are
determined. One such method used for background
subtraction is due to Zivkovic [14] where a Gaussian
mixture model (GMM) is employed for estimating the
probability density function. GMM modelling helps to cope
with pixel values of foreground objects having complex
distributions [14]. Another approach is based on color
segmentation or background modeling. This method
proposes using color information obtained from background
subtraction and shadow detection to improve object
segmentation and background update [15].
In this paper, a method is proposed for enhancing the
quality of the interpolated frame by using foreground and
background identities of pixels determined by the
background subtraction method. With this technique, the
motion vectors of the pixels of the interpolated frame with
no motion vectors mapped to them from the existent frames
can be successfully determined. Next section describes the
proposed method and section 3 shows experimental results
providing comparisons against the methods of [1],[2] and
[16] as well as demonstrating the gain with the use of
segmentation. Final section presents concluding remarks.
2. THE PROPOSED METHOD
Let subscript 0 denote the previous frame and subscript 1
denote the current frame of the two successive frames of a
video sequence. The interpolated frame midway between (in
time) these two frames is denoted by subscript ½. The
notation ),(, yxmv ji denotes a motion vector at coordinates
),( yx in frame i that points to frame j. In the current
implementation, the initial motion field between frames 1
and 0 is obtained by using Hexagonal block matching
algorithm [17] at full pixel accuracy.
During the first stage of the proposed method, the
foreground region is determined by the background
subtraction method. Later, when the interpolated frame is
constructed, this information, that indicates the positions of
the moving object pixels, is utilized.
The second stage is the scaling of the motion vectors
from ),(0,1 yxmv to ),(2/1,1 yxmv for each pixel. Since the
motion vector ),(2/1,1 yxmv is obtained by scaling, it does
not have full pixel accuracy. Let x~ and y~ be used to denote
the coordinates of the half pixel accurate point in frame ½
which can be found via the following equations:
),(~,2/1,1 yxmvxx x+=
),(~,2/1,1 yxmvyy y+=
The pixel positions in frame ½ in a half pixel
neighborhood around x~ and y~ are then determined. The
coordinates of such a pixel satisfies,
)5.0~()5.0x~x( ≤−′≤−′ yyAND
The motion vector(s) at coordinates )','( yx frame ½ can now
be expressed by using the motion vectors from frame 1 to
frame 1/2.
),()','( 2/1,11,2/1 yxmvyxmv −=
),()','( 2/1,10,2/1 yxmvyxmv =
Before we can start to reconstruct frame ½ one important
step is needed to perform the interpolation successfully. The
background/foreground identities of pixels in frame ½ need
to be determined. Since ),()','( 2/1,11,2/1 yxmvyxmv −= ,
if pixel at coordinates ),( yx belongs to foreground then
pixel at coordinates )','( yx
belongs to foreground, too.
Otherwise pixel at coordinates )','( yx belongs to
background.
As a consequence of the many-to-one motion vector
mapping from frame 1 to frame ½, multiple motion vectors
may be assigned to a pixel of frame ½ (termed overlapped
pixels), no motion vector may be assigned to a pixel of
frame ½ (termed gap pixels) or a single motion vector may
be assigned to a pixel of frame ½. Each of the three types is
handled differently in the interpolation stage.
For the overlapped pixel type, more than one motion
vector from frame 1 to frame ½ maps to the same location
and the most reliable one needs to be singled out. For a
given )','( yx , the motion vector ),(2/1,1 ii yxmv with the
minimum value of ( ) ( )22 ~'~' ii yyxx −+− is used. If there is
more than one such motion vector, the one for which
),(0,1 ii yxmv has the least block matching error (SAD)
between frames 1 and 0 is used.
))),(((minarg* 0,1 iii yxmvSADi =
2/),(),( **0,11,2/1 ii yxmvyxmv −=′′
Then, )','(1,2/1 yxmv is used to transfer pixel value from
frame 1 to frame ½ by bilinear interpolation of pixel values.
Since half pixel accurate motion vectors are used, bilinear
interpolation in frame 1 is performed with the following
equations.
)y,x(mvxx ,x,/ ′′+′=′′ 121 (1)
)y,x(mvyy ,y,/ ′′+′=′′ 121 (2)
xx ′′=1 , xx ′′=2 (3)
yy ′′=1 , yy ′′=2 (4)
∑∑= =
=2
1
2
112/1 ),(
4
1)','(
j kkj yxfyxf (5)
A gap pixel does not have any motion vector assigned to
it. A motion vector can be determined by linear interpolation
of the assigned motion vectors of nearby pixels. Two types
of regions for gap pixel at coordinates )','( yx can be
identified:
1. Gap pixel at coordinates )','( yx is surrounded on
all sides by foreground (or on all sides by background)
pixels (vertically when going up or down or horizontally
when going left or right in frame ½ all four nearest pixels
with known motion vectors belong to foreground
(background) region as depicted in Figure 1(a)). In this case,
we determine )','(0,2/1 yxmv by the bilinear interpolation
using the motion vectors of the nearest foreground
(background) pixels having the same vertical or horizontal
coordinate as
+′′×+
+′+′×+
+−′′×+
+′−′×+
=′′
),())/((
),())/((
),())/((
),())/((
2
1),(
0,2/1
0,2/1
0,2/1
0,2/1
0,2/1
σβσβ
δαδα
ββσσ
ααδδ
yxmv
yxmv
yxmv
yxmv
yxmv
where α , β , δ , and σ are the distances to the nearest
neighboring pixels that are not gap pixels. Then, we use
)','(0,2/1 yxmv to unidirectionally transfer bilinear
interpolated pixel value from frame 0 to frame ½ similar to
(1)-(5).
2. Gap pixel at coordinates )','( yx is surrounded on
some sides by background and on other sides by foreground
(vertically when going up or down or horizontally when
going left or right in frame ½ some of the four nearest pixels
belong to background and others belong to the foreground).
If the three nearest non-gap pixels have the same
foreground/background identity as in Figure 1(b), we
determine )','(0,2/1 yxmv
by linear interpolation of their
motion vectors. On the other hand, if two nearest non-gap
pixels have foreground identity and the other two have
background identity (which typically occurs when opposite
neighboring pixels have opposite identity as in Figure 1(c)),
a tiebreaking rule is applied as follows:
Fig. 1. a) For a gap pixel for which the identities of the nearest
four non-gap pixels on four sides are the same, the motion vector is
determined by the average of the bilinear interpolations of the
motion vectors of these pixels with the same horizontal and same
vertical coordinate. b) For a gap pixel for which the two nearest
non-gap pixels on two sides are foreground (light shade) and the
other two background (dark shade), the average of the distances
between the motion vectors in frame 0 pointed to from the nearest
non-gap foreground pixels and the motion vector in frame 0
pointed to from the gap pixel by the linear interpolation of the
foreground motion vectors in frame ½ is computed and compared
against a similarly computed average distance for the background
pixels. In this case, the gap pixel motion vector is linearly
interpolated from the motion vectors of the two foreground pixels
since the average distance is smaller for the foreground (since
Bv ,1 disagrees with 0,2 =Bv and 0,3 =Bv shown as dots).
The motion vector of the pixel in frame 0, referenced by the
linear interpolation of the two foreground motion vectors
from the gap pixel position in frame ½, is computed first:
( )( )
′+′×+
+−′′×+=′′
),())/((
),())/((),(
0,2/1
0,2/1
0,2/1yxmv
yxmvyxmv
F
δαδα
ααδδ
( )),(),,( ,0,2/1,0,2/11,0,1 yxmvyyxmvxmvv Fy
FxF ′′+′′′+′= −
The average of the distances between this vector and each of
the two motion vectors of pixels in frame 0 referenced by the
motion vectors of the two foreground pixels in frame ½ is
determined next:
( )),(),,( ,0,2/1,0,2/11,0,2 ααα −′′+−′−′′+′= − yxmvyyxmvxmvv yxF
( )),(),,( ,0,2/1,0,2/11,0,3 yxmvyyxmvxmvv yxF′+′+′′+′++′= − δδδ
( ) 2/,3,1,2,1 FFFFF vvvvd −+−=
A similar average distance computation is then performed
for the background pixels to yield:
αβ
σ
δ
gapα
βσ
δ
gap
αgapgap
(a)
0 frame
2/1 frame
Fv ,1
Fv ,2
Fv ,3
Bv ,1
0,2 =Bv
0,3 =Bv
αβ
σ
δ
gapα
β
δ
gap
(b)
2/1 frame
2/1 frame
α
βσ
δ
(c)
( )( )
′−′×+
++′′×+=′′
),())/((
),())/((),(
0,2/1
0,2/1
0,2/1yxmv
yxmvyxmv
B
βσβσ
σσββ
( )),(),,( ,0,2/1,0,2/11,0,1 yxmvyyxmvxmvv By
BxB ′′+′′′+′= −
( )),(),,( ,0,2/1,0,2/11,0,2 σσσ +′′++′+′′+′= − yxmvyyxmvxmvv yxB
( )),(),,( ,0,2/1,0,2/11,0,3 yxmvyyxmvxmvv yxB ′−′+′′−′+−′= − βββ
( ) 2/,3,1,2,1 BBBBB vvvvd −+−=
If the average distance for the foreground (background)
pixels exceeds the average distance for the background
(foreground) pixels, the background (foreground) pixels’
motion vectors are used to interpolate the motion vector of
the gap pixel.
{ }( )j
FBjdj
,minarg*∈
=
)','()','(*
0,2/10,2/1 yxmvyxmvj=
The above decision rule relies on the assumption of
consistency of motion vectors of spatially adjacent pixels of
the same identity in frames 0 and ½.
Once the motion vector )','(0,2/1 yxmv is determined, it
is used to transfer bilinear interpolated pixel value from
frame 0 to frame ½ similar to (1)-(5).
For a pixel in frame ½ at coordinates )','( yx with one
motion vector assigned to it, both )','(1,2/1 yxmv and
)','(0,2/1 yxmv are used to transfer the average of bilinear
interpolated values from frames 1 and 0. Hence for such a
pixel the motion compensation is bidirectional.
The method described above has been integrated with the
foreground/background segmentation algorithm of [14] that
is appropriate for videos for which camera view is
stationary.
3. EXPERIMENTAL RESULTS
In order to show the effectiveness of the proposed method,
performance comparisons with [1] and [2] on seven un-
compressed video sequences were conducted. The sequences
used in the simulations were Akiyo, Foreman, Carphone,
Garden, Mobile, News, and Stefan with respective
resolutions of 352 x 288, 176 x 144, 176 x 144, 352 x 240,
352 x 288, 352 x 288, and 352 x 240. First 50 interpolated
frames of Carphone, Foreman, Garden, News, and Stefan
and 150 interpolated frames of Akiyo were used to perform
performance comparisons with reported results of [1], [2]
and [16]. For Mobile, first 50 interpolated frames were used
to perform a comparison with [1] and 150 frames were used
to perform a comparison with [2].
Alongside the standard PSNR quality measure, the
structural similarity test (SSIM) has also been used to assess
performance. SSIM compares local patterns of pixel
intensities that have been normalized for luminance and
contrast and is defined in [18] as
))((
)2)((2),(
22
22
112
22
1
22,112121
CC
CCIISSIM
++++
++=
σσµµ
σµµ
where iµ and 2iσ are the mean and variance of luma values
in a 8x8 window of i’th image and ji,σ is the covariance
between luma values in corresponding 8x8 windows of i’th
and j’th images. 1C and 2C are constants based on the
dynamic range of the luma values to stabilize the division
with weak denominator.
Tables 1 and 2 present the comparison of PSNR and
SSIM results obtained with the methods in [1] and [2],
respectively. When compared to [2], the proposed method
gives nearly 6.5 dB higher peak signal to noise ratio (PSNR)
for Akiyo. Moreover, gains of 2 dB for Foreman, 1 dB for
Carphone, 4 dB for Mobile and 3.5 dB for Garden are
observed when the proposed method is compared to [1].
Table 3 shows a comparison of the proposed method
with the correlation based motion vector processing method
proposed in [16]. While the performance of [16] is better
than the proposed method for the Football video, the
performances of [16] and the proposed method are
comparable for the other two videos tested.
Table 4 shows the performance gain with the
foreground/background segmentation algorithm in the
proposed method. In the no segmentation case used as a
reference, motion vectors of gap pixels are reconstructed by
bilinear interpolation of the motion vectors of the nearest
four non-gap pixels. It is seen that PSNR results for four of
the seven sequences are significantly better when
segmentation is integrated.
Figure 2 presents a frame of Akiyo interpolated with the
proposed method and the corresponding frame interpolated
with [2] for comparison. The artifacts observed at the lips
with [2] do not appear with the proposed method. In Figure
3, even though the contour details are sharper with less
ringing artifacts for the method of [2], the interpolatation
with the proposed method does not suffer from sharp,
unexpected edge discontinuities and blocking artifacts as
does [2].
4. CONCLUSION
In this paper, we introduced a foreground/background
segmentation based frame rate up-conversion method. For
MC-FRUC, the difficulty of identifying covered-uncovered
regions introduces artifacts into the interpolated frame. In
this work, the foreground/background segmentation data was
integrated with the MC-FRUC algorithm as a solution to this
problem. Experimental results on several test video
sequences have shown that this new method has better
performance in terms of PSNR and SSIM than that of [1]
and [2], and comparable performance to that of [16] on
interpolated frames. The proposed method is free of
blocking artifacts and thus provides a high subjective view
quality.
Owing largely to the segmentation algorithm integrated,
the proposed method is currently designed to work on only
video with stationary background. Future research is aimed
at extending the proposed method by integrating it with
more advanced segmentation algorithms that can handle
panning or more complex motions of the background as well
as overlapping motions of multiple foreground objects.
Table 1. Performance comparisons with the method of [1].
Video
Sequences
Method in [1] Proposed Method
PSNR(dB) SSIM PSNR(dB) SSIM
Carphone 33.6726 0.9542 34.066 0.970
Foreman 33.2036 0.9543 35.016 0.970
Mobile 23.7419 0.9106 27.670 0.942
Garden 24.6681 0.8938 28.117 0.945
Table 2. Performance comparisons with the method of [2].
Video
Sequences
Method in [2] Proposed Method
PSNR(dB) PSNR(dB) SSIM
Akiyo 39.458 46.752 0.996
News 32.386 37.065 0.981
Mobile 20.757 28.252 0.951
Stefan 22.347 26.813 0.916
Table 3. Performance comparisons with the method of [16].
Video
Sequences
Method in [16] Proposed Method
PSNR(dB) SSIM PSNR(dB) SSIM
Football 25.110 0.780 22.237 0.670
Foreman 31.700 0.960 32.343 0.927
Stefan 26.490 0.900 26.568 0.905
Table 4. Performance gain with the use of segmentation.
Video
Sequences
NoSegmentation Segmentation
PSNR SSIM PSNR SSIM
Akiyo 46.743 0.996 46.752 0.996
News 37.080 0.981 37.065 0.976
Foreman 30.832 0.917 32.343 0.927
Carphone 33.092 0.963 34.066 0.967
Mobile 28.077 0.946 28.252 0.951
Garden 27.430 0.939 28.117 0.945
Stefan 26.284 0.913 26.813 0.916
(a)
(b) (c)
Fig. 2. Akiyo sequence subjective comparison of a) Original frame
40 b) Method in [2] and c) Proposed Method
(a)
(b) (c)
Fig. 3. Mobile sequence subjective comparison of a) Original
frame 40 b) Method in [2] and c) Proposed Method
5. REFERENCES
[1] X. Gao, Y. Yang, B. Xiao, “Adaptive frame rate up-
conversion based on motion classification,” Elsevier,
International Journal of Signal Processing, vol. 88, no.
12, pp. 2979-2988, 2008.
[2] S.J Kung, D.G Yoo, S.K. Lee, Y.H Kim, “Design and
Implementation of Median Filter based Adaptive
Motion Vector Smoothing for Motion Compensated
Frame Rate Up-Conversion,” IEEE 13th International
Symposium Consumer Electronics ISCE’09, pp.745-
748, Kyoto, Japan, May 2009.
[3] A. N. Netravali and J. D. Robbins, “Motion-adaptive
interpolation of television frames,” Proc. Picture
Coding Symposium, June 1981.
[4] G. De Haan, Paul W.A.C. Biezen, H. Huijgen, O. A.
Ojo, “True Motion Estimation With 3-D Recursive
Search Block Matchingş” IEEE Trans. Circuit Sys.
Video Tech., pp. 368-388, Oct. 1993.
[5] J. K. Su, R. M. Mersereau, “Motion estimation methods
for overlapped block motion compensation,” IEEE
Trans. Image Proc., vol. 9, no.9, pp. 1509-1521, Sept.
2000.
[6] A. M. Huang and T.Q. Nguyen, “Correlation-based
motion vector processing for motion compensated
frame interpolation,” IEEE Trans. Image Proc., vol. 17
no.5, pp. 694–708, May 2008.
[7] M. Cetin and I. Hamzaoglu, “An adaptive true motion
estimation algorithm for frame rate conversion of high
definition video,” International Conference on Pattern
Recognition, pp.4109-4112, Istanbul, Turkey, August
2010.
[8] M. H. Chan, Y. B. Yu, and A. G. Constantinides,
“Variable size block matching motion compensation
with application to video coding,” Proc. Inst. Elect.
Eng. , vol. 137, no. 4, pp. 205–212, Aug. 1990.
[9] B.-D. Choi, J.-W. Han, C.-S. Kim, S-J. Ko, “Motion-
Compensated Frame Interpolation Using Bilateral
Motion Estimation and Adaptive Overlapped Block
Motion Compensation,” IEEE Trans. Circuits and
Systems for Video Tech., vol. 17, no. 4, pp. 407-416,
April 2007
[10] Wei Hong, “Low-Complexity Occlusion Handling for
Motion-Compensated Frame Rate Up-Conversion,” Int.
Conf. Consumer Electronics 2009, ICCE’09, pp. 1-2,
Las Vegas, NV, January 2009.
[11] B. Cizmeci and H.F. Ates. “Occlusion Aware Motion
Compensation for Video Frame Rate Up-Conversion,”
Proc. IASTED International Conf. on Signal and Image
Processing (SIP), Maui, Hawaii, August 2010.
[12] P.M Jodoin, C. Rosenberger, M. Mignotte, “Detecting
Half-Occlusion with a Fast Region Based Fusion
Procedure,” British Machine Vision Conference, pp.
417-426, Edinburgh, UK, 2006.
[13] S. Ince and J.Konrad, “Geometry-Based Estimation of
Occlusions From Video Frame Pairs,” IEEE Trans.
Acoustics, Speech and Signal Processing, vol 2, pp.
ii/933-ii/936, March 2005.
[14] Zoran Zivkovic, “Improved adaptive Gaussian mixture
model for background subtraction,”, Proceedings of
17th Intl. Conf. Pattern Recognition, ICPR 2004, vol 2,
pp. 28-31, Cambridge, UK, August 2004.
[15] R. Cucchiara, M. Piccardi and A.Prati, “Detecting
Moving Objects, Ghosts, and Shadows in Video
Streams,” IEEE Pattern Analysis and Machine
Intelligence, vol. 25, no. 10, pp. 1337-1342, Sept. 2003.
[16] A.M. Huang, T. Nguyen, “Correlation-Based Motion
Vector Processing with Adaptive Interpolation Scheme
for Motion-Compensated Frame Interpolation,” IEEE
Transactions on Image Processing , vol. 18, no. 4, pp.
740 752, April 2009.
[17] A. Hamosfakidis and Y. Paker, “A Novel Hexagonal
Search Algorithm for Fast Block Matching Motion
Estimation,” EURASIP Journal on Applied Signal
Processing vol. 2002 pp. 595-600, no. 6, June 2002.
[18] H. R. Sheikh, E.P Simoncelli, Z. Wang, A.C Bovik,
“Image quality assessment: from error visibility to
structural similarity,” IEEE Trans. Image Proc., vol. 13
no. 4pp. 600–612, April 2004.
[19] Y.M Chen, Ivan.V. Bajic, C.Qian, “Frame Rate Up-
Conversion of Compressed Video Using Region
Segmentation and Depth Ordering,” Proc. IEEE
PacRim'09, pp. 431-436, Victoria, BC, August 2009.