Shot Boundary Detection Using Macroblock Prediction Type Information

7/27/2019 Shot Boundary Detection Using Macroblock Prediction Type Information

1/4

Shot Boundary Detection Using Macroblock Prediction Type

Information

S. De Bruyne1, K. De Wolf1, W. De Neve1, P. Verhoeve2, and R. Van de Walle3,

1 Ghent University - IBBT, ELIS, Multimedia LabGaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

2 TelevicLeo Bekaertlaan 1, B-8870, Izegem, Belgium

3 Ghent University - IBBT - IMEC, ELIS, Multimedia LabGaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

Abstract The increasing availability and use of digitalvideo has led to a high demand for efficient video analysistechniques. The starting point in video browsing and re-trieval systems is the low-level analysis of video content,especially the segmentation of video content into shots.In this paper, we propose a method for automatic videoindexing based on the macroblock prediction type infor-mation of MPEG-4 Visual compressed video bitstreamswith varying GOP structures. This method exploits thedecisions made by the encoder in the motion estimation

phase, resulting in specific characteristics of the mac-roblock prediction type information when shot bound-aries occur. By working on compressed domain informa-tion, full frame decoding of MPEG-4 Visual bitstreams isavoided. Hence, fast segmentation can be achieved com-pared to metrics in the uncompressed domain.

1 Introduction

Recent advances in multimedia compression technology,

combined with the significant increase in computer per-formance, as well as the growth of the Internet, haveresulted in the widespread use and availability of digitalvideo. As a consequence, many terabytes of video dataare stored in large video databases. They are often in-sufficient cataloged and are only accessible by sequentialscanning of the sequences. This has resulted in a grow-ing demand of new technologies and tools for the efficientindexing, browsing and retrieval of digital video data.

Shot boundary detection has been generally acceptedas the necessary prerequisite step to achieve automaticvideo content analysis. A shot is defined as a sequenceof frames continuously captured from the same camera.According to whether the transition between consecutiveshots is abrupt or not, boundaries are classified as cutsor gradual transitions respectively [1].

Most of the methods to perform shot boundary de-tection are developed in the uncompressed domain [1].In this domain, lots of features like color and edges canbe exploited, resulting in a high prediction accuracy. Onthe other hand, in the compressed domain, full decom-pressioncan be avoided by using compressed domain fea-tures only [2,3]. Hereby, real-time processing is possible.Since most video data are stored in compressed formatsfor efficiency of storage and transmission, we focus onmethods in the compressed domain. Up to now, most al-

gorithms in this domain focus on MPEG-2 coded video.To go one step further, this paper takes a closer lookat an algorithm for bitstreams compliant with the Ad-vanced Simple Profile of MPEG-4 Visual.

In this paper, we propose a method that is based onthe encoders search for the best prediction. This processselects for each macroblock (MB) the prediction typethat obtains the most efficient encoding. This results inspecific patterns of the MB type information when suc-cessive frames have dissimilar contents, such as consec-utive frames belonging to different shots.

The outline of this paper is as follows. In Section 2,background information on MPEG-4 Visual compressed

video sequences is provided. The actual algorithm forshot boundary detection is elaborated in Section 3. Sec-tion 4 discusses some performance results as obtained byour method while Section 5 concludes this paper.

2 High-level Overview of MPEG-4 Visual

Compressed video sequences, conforming to the MPEG-4Visual specification [4], consist of three kinds of frames.I-VOPs (Video Object Plane) are coded without anyreference to other frames, P-VOPs are coded by usingmotion-compensated prediction from a previous frame,and B-VOPs use bidirectional motion-compensated pre-diction, meaning that a previous as well as a future frame


2/4

2 S. De Bruyne, K. De Wolf, W. De Neve, P. Verhoeve, and R. Van de Walle,

Ri-1 Bi bi+1 Ri+2

a

RiRi-3 Bi-2 bi-1

b

(i)

Ri-2 Bi-1 bi Ri+1

c

(i) (i+1)

(i+1)(i-1) (i)

(i) (i-1)

(i-1)(i-2)

(i-1)(i-2)

Fig. 1 Possible positions of a cut in a frame triplet

can be used. These I- and P-VOPs are denoted as ref-erence frames as they can be used for the prediction ofP- and B-VOPs. All these frames can be embedded inGroups Of Pictures (GOP) corresponding to the struc-ture described by the regular expression IB*(PB*)*.

A singular frame is further divided in MBs which

contain information about the type of the temporal pre-diction and the corresponding motion vector(s) (MV)used for motion compensation. The possible predictiontypes are Intra coded (without any prediction from otherframes) and Inter coded (using motion compensated pre-diction from one or more previously encoded frames).The latter is subdivided in Forward, Backward and Bidi-rectional referenced prediction. Depending on the frametype to which the MBs belong, one can choose from oneor more prediction types. In particular, I-VOPs containonly intra coded MBs, P-VOPs consist of intra codedand forward referenced MBs, and B-VOPs contain for-

ward, backward and bidirectional referenced MBs. Ourmethod, based on an algorithm for MPEG-2 video se-quences [3], makes use of this prediction information tolocate possible shot boundaries.

3 Shot Boundary Detection

3.1 Proposed Method

Within a video sequence, a continuous strong inter-framecorrelation is present as long as no significant changesoccur. Due to the high similarity in a shot, most of the

MBs belonging to B-VOPs are coded using bidirectionalprediction because a B-VOP bears a strong resemblancewith the past as well as with the future reference frame.These MBs can be encoded with a specific bidirectionalmode, i.e. the direct mode. This mode utilizes the MVsof the corresponding reference frames as a prediction ofits own MV(s). This mode provides higher compressionwhen the motion in successive frames is similar. On theother hand, when a B-VOP has a high resemblance withonly one of its reference frames, it mainly refers to thisframe by forward or backward prediction and uses hardlyany of the information in the other reference frame. It isclear that the amount of the different kinds of predic-tion types in MBs can be applied to define a metric forlocating possible shot boundaries. One could expect thatshot boundaries always occur at I-VOPs, but it should

0 1 0 20 30 40 50 60 70 80 90 1000

250

500

Fig. 2 Example of a metric containing two shot boundaries

be mentioned that this notion is not enough to detectshot boundaries. This is due to the fact that I-VOPsare often used as random access points in a video andtherefore occur more often than shot boundaries.

First we consider a specific (most commonly used)GOP structure to explain the underlying idea. This GOPstructure [IBBPBBPBB] can be split into groups of threeframes having the form of a triplet IBB or PBB. In whatfollows, the reference frames I and P will be denotedas Ri, the front bidirectional frame as Bi and the rear

bidirectional frame as bi. According to this convention,the video sequence can be analyzed as a group of tripletsof the form R1B2b3 R4B5b6 R7B8b9.

Fig. 1 visualizes the three possible locations of a shotboundary (i.e. a cut) in a frame triplet. In the first case(a), one assumes that the front bidirectional frame Biis the first frame with a different content. Since bothbidirectional frames Bi and bi+1 have hardly any resem-blance to the previous reference frame Ri1 and a closeresemblance to the following reference frame Ri+2, mostof their MBs will be backward referenced. If the con-tent change occurs at the rear reference frame Ri (b),this new information cannot be used by the bidirectionalframe Bi2 and bi1. Therefore, most of these MBs willuse forward prediction. Finally, if the content change oc-curs at bi (c), Bi1 will be strongly predicted forwardby the first reference frame Ri2, while bi will be mainlypredicted backward by the rear reference frame Ri+1. Incase the content remains similar, none of these patternsabove takes place so that the major part of the MBs inthe B-VOPs will be bidirectional predicted.

Based on these assumptions, a metric can be definedfor the visual frame difference by analyzing the percent-age of MBs in a frame that are forward and/or back-ward referenced. Let T(i) be the number of forwardreferenced MBs and T(i) the number of backward ref-erenced MBs of a given frame with index i and frametype T. The frame difference metric (i) can be definedas follows [3]:

(i) =

8

>

:

B(i) + b(i + 1), if i is a B-frame (a)

B(i 2) + b(i 1), if i is a R-frame (b)

B(i 1) + b(i), if i is a b-frame (c)

Peaks in (i) represent strong and abrupt changesin the video content. In Fig. 2, an example of a metric isshown containing two peaks at frames 34 and 59 whichcorrespond with two shot boundaries in the video se-quence. To achieve automatic shot boundary detection,the results are compared with a predefined constant oran adaptive threshold, based on the mean and the vari-ation of(i) for surrounding frames.


3/4

Shot Boundary Detection Using Macroblock Prediction Type Information 3

3.2 Generalized Approach

Most video sequences available at the moment do notsatisfy the above mentioned structure. For example, whenthere is a lot of motion in consecutive frames, it is hard tofind a proper prediction from the reference frames. Intracoding only exists for reference frames, and as a con-sequence, all MBs in a B-VOP need to be coded usingprediction. Therefore, the encoder often prefers to usemore reference frames instead of bidirectional frames.On the other hand, when there is hardly any differencebetween successive frames, the encoder can prefer to en-code more than two frames as B-VOPs in order to in-crease the compression rate. Moreover, not all sequenceshave the same resolution, and as a consequence differentthresholds are needed. To overcome these problems, themethod needs to be expanded to comply with all sorts of

encoded video sequences under the restriction that theencoded sequences contain B-VOPs.

When taking a closer look at the algorithm above,two cases can be distinguished, in particular when theframe is a reference frame or a bidirectional frame. Incase of a reference frame, (i) is obtained by countingthe number of forward predicted MBs of the bidirec-tional frames lying between the current reference frameand the preceding reference frame. However, when thereare no B-VOPs present between these two reference fra-mes, this approach does not work. Therefore, the algo-rithm needs to be adjusted so that the value of (i)

corresponds to the cardinality of the intra coded MBsin the current frame. In case of bidirectional frames, thevalue of(i) is obtained by taking the sum of all forwardreferenced MBs of the preceding bidirectional frames andthe backward referenced MBs of the current and follow-ing bidirectional frames between the previous and nextreference frames. Furthermore, the obtained results needto be scaled by dividing (i) by the number of bidirec-tional frames and the number of MBs in a frame.

Let (i), (i) and (i) be the number of forward ref-erenced, backward referenced and intra coded MBs resp.of a given frame with index i. Further assume that #mbis the total number of MBs in a frame and n the number

of bidirectional frames between the previous referenceframe Rf with frame index f and the current or fol-lowing reference frame Rr with index r. The extendeddifference metric (i) can be defined as:

(i) =

8

>

>

>

>

>

>

>

>

>

>

>

:

1

n.#mb

f+nP

j=f+1

(j), if i is a R- and i-1 a B-frame

1#mb

(i) , if i and i-1 are R-frames

1n.#mb

i1P

j=f+1

(j) +f+n

P

j=i

(j)

!

, if i is a B-frame

Due to the applied division, (i) is contained in theinterval [0, 1] and consequently represents the probabil-ity of a shot boundary at position i. As a result, a con-stant threshold can be chosen for the automatic shotboundary detection of various kinds of video sequences.

For gradual changes, a similar approach is adopted.

4 Experimental Results

In order to evaluate the proposed method, an MPEG-4Visual complying decoder (XviD) was adjusted to sup-

port shot boundary detection. First, the performance ofthe algorithm was evaluated based on several kinds ofvideo. Afterwards, the influence of the bitrate and theparameters for motion estimation on the encoder sidewere examined, and the results were compared to algo-rithms in the uncompressed domain.

In the test phase, four sequences with a frame sizearound 352x192 were selected. The first one is a part ofthe movie Drive, containing 64 shot boundaries whereshots with lots of object and camera motion are alter-nated with dialogs. Shrek2, Return of the Jedi andTroy are all trailers of movies brimming with all kindsof shot changes, each consisting of resp. 93, 51 and 36

shot boundaries. Especially Jedi and Troy are a realchallenge since they are full of motion, gradual changes,special effects, variations in light intensity, et cetera.

4.1 Performance

To evaluate the performance of the algorithm, a compar-ison based on the number of missed detections (MDs)and false alarms (FAs) is examined:

Recall =Detects

Detects + MDsPrecision =

Detects

Detects + FAs

In Table 1, the performance of the proposed algo-rithm is presented for the above mentioned video se-quences coded at a bitrate around 680kBit/s. For these

Table 1 Performance of the algorithm

detects MD FA recall precision

drive 64 0 2 100% 97%shrek2 88 5 10 95% 90%

jedi 50 1 9 98% 85%troy 29 7 1 81% 97%

test results, the major part of the missed detections are

caused by gradual changes since these changes are spreadover an unknown number of frames and in some cases,there is hardly any difference between two consecutiveframes. This is a problem which most of the shot bound-ary detection algorithms have to cope with. The falsealarms have various reasons. Sudden changes in light in-tensity, such as lightning, camera flashlights and explo-sions, often lead to false alarms. This is due to the factthat the current image cannot be predicted from previ-ous reference frames since the luminance highly differs.Uniform black shots also cause problems since the en-coder prefers forward prediction in case of black frames.It should be possible to solve this problem by having alook at the DC coefficients.

When a shot contains lots of movement, originatingfrom objects or the camera, false alarms will often occur.


4/4

4 S. De Bruyne, K. De Wolf, W. De Neve, P. Verhoeve, and R. Van de Walle,

Due to this motion, successive frames will have less simi-larity and it will be more difficult for the encoder to finda good prediction. This leads to lots of intra coded mac-roblocks, and therefore the structure of the macroblock

type information in successive frames bears resemblanceto gradual changes. When taking a closer look at the testresults, it also catches the eye that nearly all cuts weredetected. This implies that the performance for simplevideo sequences like news programs and drama soaps canbe expected to be very high.

4.2 Inf luences of the Bitrate

The influence of the bitrate chosen at the encoder sideon the performance of the algorithm is shown in Table2. These results show that the recall and the precision

for different bitrates are alike. Nevertheless, one can seethat the amount of false alarms slightly increases andthe missed detections decreases when the bitrate rises.This is due to the fact that the encoder prefers morereference frames and intra coded MBs in shots with alot of motion when the bitrate is higher. The results forother sequences are similar.

4.3 Influences of the Parameters for Motion Estimation

The influence of the parameters for motion estimationat the encoder side are shown in Table 3. These para-

meters determine the complexity of the search window,the sub pixel motion search precision, et cetera. Basedon the parameter values, several tests were carried outfor low, medium and high complex motion estimation.From these results, one can conclude that the complex-ity of the encoders search for the best prediction onlyhas a minor impact on the performance.

4.4 Comparison with Methods in the UncompressedDomain

In the past, methods based on features in the uncom-

pressed domain were investigated in our research group,in particular global color histograms and edge detectionalgorithms using Sobel filtering techniques. The compar-ison for two test sequences is given in Table 4. Our al-gorithm outperforms edge detection, but the algorithmbased on histograms has even better results. It is obvi-ous that algorithms in the uncompressed domain acquirea higher performance since they possess more features.The great advantage in the compressed domain on theother hand is the fast segmentation, in particular withthis method that is faster than real-time.

When comparing the framerates, we notice that ouralgorithm is a factor 5.66 faster than the color histogram,and even more for edge detection. On a regular desktop,our algorithm achieves a framerate of 320 frames persecond.

Table 2 Influence of the bitrate on Jedi and Troy

bitrate (kBit/s) detects MD FA recall precision

Jedi 290 49 2 8 96% 84%Jedi 680 50 1 9 98% 85%Jedi 980 50 1 11 98% 82%

Troy 290 29 7 0 81% 100%Troy 680 29 7 1 81% 97%Troy 980 30 6 1 83% 97%

Table 3 Influence of motion estimation on Jedi and Troy

compl exity detects MD FA recall precision

Jedi low 50 1 9 98% 85%Jedi medium 50 1 9 98% 85%

Jedi high 50 1 9 98% 85%Troy low 28 6 1 82% 97%

Troy medium 28 7 1 80% 97%Troy high 27 7 2 79% 93%

Table 4 Comparison with the uncompressed domain

Shrek 2 Troyrecall precision recall precision

histograms 90 % 100 % 88 % 100 %

edge detection 78 % 94% 70 % 92 %macrobloks 95 % 90 % 79% 93%

5 Conclusion

In this paper, we discussed an algorithm for automaticshot boundary detection based on macroblock predic-tion type information, a feature that is available in thecompressed domain. Formulas for a generic GOP struc-ture are presented, which are very useful in practice, butwhere other papers pay little attention to. The measure-ments illustrate that this algorithm performs well, keep-

ing in mind that this approach is quite consistent fordifferent encoder parameter settings and far more rapidthan methods in the uncompressed domain.

Acknowledgement: The research as described inthis paper was funded by Ghent University, the Interdis-ciplinary Institute for Broadband Technology (IBBT),the Institute for the Promotion of Innovation by Scienceand Technology in Flanders (IWT), the Fund for Sci-entific Research-Flanders (FWO-Flanders), the BelgianFederal Science Policy Office (BFSPO), and the Euro-pean Union.

References

1. U. Gargi, R. Kasturi, and S. Strayer, Performance Char-acterization of Video-Shot-Change Detection Methods,IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 10, pp. 1-13, 2000.

2. S. Pei, and Y. Chou, Efficient MPEG Compressed VideoAnalysis Using Macroblock Type Information, IEEETransactions on Multimedia, vol. 1, pp 321-333, 1999.

3. J. Calic, and E. Izquierdo, Towards Real-Time Shot De-tection in the MPEG Compressed Domain, Proceedingsof the Workshop on Image Analysis for Multimedia In-teractive Services, 2001.

4. ISO/IEC JTC 1, Coding of audio-visual objects - Part 2:Visual, ISO/IEC 14496-2 (MPEG-4 visual version 1),April 1999.

Shot Boundary Detection Using Macroblock Prediction Type Information

Documents

Transcript of Shot Boundary Detection Using Macroblock Prediction Type Information