A complete framework for temporal video...

5
A Complete Framework for Temporal Video Segmentation Ruxandra Tapu Institut Télécom / Télécom SudParis, ARTEMIS Department, UMR CNRS 8145 MAP5 Evry, France [email protected] Titus Zaharia Institut Télécom / Télécom SudParis, ARTEMIS Department, UMR CNRS 8145 MAP5 Evry, France [email protected] Abstract—In this paper we propose a complete high level segmentation algorithm of video flows into scenes. In the first stage of our implementation we detected shot boundaries using an enhanced graph partition method based on non-linear scale space filtering at reduce computational time. In the second phase we develop static summaries for each detected shot based on a leap extraction technique that selects a variable number of keyframes depending on the visual content variation. Finally, we propose an iterative, temporally constrained shot clustering technique that detects video scenes with an average precision and recall rates of 85% and 84%. Keywords - scene detection; static summary; shot merging; temporal constraints. I. INTRODUCTION Recent trends in telecommunications collaborated with the expansion of image and video processing and acquisition devices has lead to the rapid growth of the amount of data stored and transmitted over the Internet. Existing commercial and industrial video searching engines are currently based solely on textual annotations, which consist of attaching some keywords to each individual item in the database. These approaches show quickly their limitation when being confronted with the immense amount of databases available on the Internet, due to the intrinsic and poly-semantic content of multimedia documents or to various linguistic difficulties. When considering video indexing and retrieval applications, we first have to consider the issue of structuring the huge and rich amount of heterogeneous information related to video content. Automatic video annotation can be carried out at different stages starting from frame analysis, where visual features are extracted from each image in order to characterize its content up to the higher semantic levels where concepts are recognized. In this context the video segmentation process involves the following phases: shot boundary detection, automatic image sequence abstraction and shot clustering based on similarity constraints in order to construct semantically pertinent scenes. The rest of this paper is organized as follows. In Section II we briefly introduce the proposed shot detection algorithm. Section III describes the static summary development procedure based on representative keyframes. A novel scenes extraction algorithm based on hierarchical clustering and temporal constraints is introduced in Section IV. Section V presents and analyzes the experimental results obtained. Finally, Section VI concludes the paper and opens some perspectives of future work. II. SHOT BOUNDARY DETECTION SYSTEM Shot boundary detection represents a key and mandatory stage that needs to be performed prior to any effective description/classification of video documents. Numerous efficient shot boundary detection algorithms have been proposed and validated during the last decade, as witnesses the rich scientific literature dedicated to the subject. Within this framework, the challenge is to elaborate robust shot detection methods, which can achieve high precision and recall rates whatever the movie quality and genre, the creation date and the techniques involved in the production process, while minimizing the amount of human intervention. We start our analysis by using our previous work presented in [1] that introduces an enhanced shot boundary detection algorithm, based on the graph partition model combined with a non-linear scale-space filtering. The scale space analysis permits the evaluation of a relative measure the change ratio, instead of an absolute one, given by the graph partition algorithm. Defined in a relative way, the results are less sensitive to the type of video sequence being analyzed. We focused next in reducing the global computational complexity by implementing a two-pass approach also introduced in [1]. In a first step, the algorithm detects time intervals which can be reliably considered as belonging to the same shots. Abrupt transitions considered as certain are also detected in this stage. In a second step, the graph partition analysis is further performed only for the remaining uncertain time intervals. III. KEYFRAME EXTRACTION A. Related work One of the first attempts to automate the extraction process was to choose as a keyframe the first, the middle or the last picture appearing after each detected shot boundary [2] or a random image within a shot. However, while being sufficient 2011 IEEE International Conference on Consumer Electronics - Berlin (ICCE-Berlin) 978-1-4577-0234-1/11/$26.00 ©2011 IEEE 156

Transcript of A complete framework for temporal video...

Page 1: A complete framework for temporal video segmentationstatic.tongtianta.site/paper_pdf/a0b40f3c-994e-11e9-88b0-00163e08… · for stationary shots, one frame does not provide an acceptable

A Complete Framework for Temporal Video Segmentation

Ruxandra Tapu Institut Télécom / Télécom SudParis,

ARTEMIS Department, UMR CNRS 8145 MAP5 Evry, France

[email protected]

Titus Zaharia Institut Télécom / Télécom SudParis,

ARTEMIS Department, UMR CNRS 8145 MAP5 Evry, France

[email protected]

Abstract—In this paper we propose a complete high level segmentation algorithm of video flows into scenes. In the first stage of our implementation we detected shot boundaries using an enhanced graph partition method based on non-linear scale space filtering at reduce computational time. In the second phase we develop static summaries for each detected shot based on a leap extraction technique that selects a variable number of keyframes depending on the visual content variation. Finally, we propose an iterative, temporally constrained shot clustering technique that detects video scenes with an average precision and recall rates of 85% and 84%.

Keywords - scene detection; static summary; shot merging; temporal constraints.

I. INTRODUCTION Recent trends in telecommunications collaborated with the

expansion of image and video processing and acquisition devices has lead to the rapid growth of the amount of data stored and transmitted over the Internet. Existing commercial and industrial video searching engines are currently based solely on textual annotations, which consist of attaching some keywords to each individual item in the database. These approaches show quickly their limitation when being confronted with the immense amount of databases available on the Internet, due to the intrinsic and poly-semantic content of multimedia documents or to various linguistic difficulties.

When considering video indexing and retrieval applications, we first have to consider the issue of structuring the huge and rich amount of heterogeneous information related to video content. Automatic video annotation can be carried out at different stages starting from frame analysis, where visual features are extracted from each image in order to characterize its content up to the higher semantic levels where concepts are recognized. In this context the video segmentation process involves the following phases: shot boundary detection, automatic image sequence abstraction and shot clustering based on similarity constraints in order to construct semantically pertinent scenes.

The rest of this paper is organized as follows. In Section II we briefly introduce the proposed shot detection algorithm. Section III describes the static summary development procedure based on representative keyframes. A novel scenes

extraction algorithm based on hierarchical clustering and temporal constraints is introduced in Section IV. Section V presents and analyzes the experimental results obtained. Finally, Section VI concludes the paper and opens some perspectives of future work.

II. SHOT BOUNDARY DETECTION SYSTEM Shot boundary detection represents a key and mandatory

stage that needs to be performed prior to any effective description/classification of video documents. Numerous efficient shot boundary detection algorithms have been proposed and validated during the last decade, as witnesses the rich scientific literature dedicated to the subject.

Within this framework, the challenge is to elaborate robust shot detection methods, which can achieve high precision and recall rates whatever the movie quality and genre, the creation date and the techniques involved in the production process, while minimizing the amount of human intervention.

We start our analysis by using our previous work presented in [1] that introduces an enhanced shot boundary detection algorithm, based on the graph partition model combined with a non-linear scale-space filtering. The scale space analysis permits the evaluation of a relative measure the change ratio, instead of an absolute one, given by the graph partition algorithm. Defined in a relative way, the results are less sensitive to the type of video sequence being analyzed.

We focused next in reducing the global computational complexity by implementing a two-pass approach also introduced in [1]. In a first step, the algorithm detects time intervals which can be reliably considered as belonging to the same shots. Abrupt transitions considered as certain are also detected in this stage. In a second step, the graph partition analysis is further performed only for the remaining uncertain time intervals.

III. KEYFRAME EXTRACTION

A. Related work One of the first attempts to automate the extraction process

was to choose as a keyframe the first, the middle or the last picture appearing after each detected shot boundary [2] or a random image within a shot. However, while being sufficient

2011 IEEE International Conference on Consumer Electronics - Berlin (ICCE-Berlin)

978-1-4577-0234-1/11/$26.00 ©2011 IEEE 156

Page 2: A complete framework for temporal video segmentationstatic.tongtianta.site/paper_pdf/a0b40f3c-994e-11e9-88b0-00163e08… · for stationary shots, one frame does not provide an acceptable

for stationary shots, one frame does not provide an acceptable representation of the visual content in dynamic sequences. Therefore, it is necessary to implement more complex methods that can be able to adapt the number of keyframes to the visual content variation of the corresponding shot [3] while removing all redundant information.

The challenge in automatic keyframes extraction is given by the necessity of adapting to the underlying semantic content, maintaining, as much as possible, the original message while removing all redundant information. In [4] the extraction process relay on the color and motion features variation within the considered shot. When selecting the first frame encountered after a shot boundary as a key frame, a possible disadvantage is given by the probability of that frame belonging to a transitional effect, reducing very much its representative power.

Different approaches use clustering techniques to find optimal keyframes [5]. Even so all clustering techniques have weak points related to the threshold parameters which control the cluster density and the computational cost.

A mosaic-based approach can generate, a panoramic image of all informational content existed in a video stream [6]. However, creating mosaics is possible solely for videos with specific object or camera motion, such as pan, zoom or tilling. In the case of movies with complex camera effects such as a succession of background/foreground changes, mosaic-based representations return less satisfactory results due to physical location inconsistency. Furthermore, mosaics can blur certain foreground objects and thus, the resulted image cannot be exploited for arbitrary shape object recognition tasks.

B. Proposed approach In this paper, we develop an automatic static storyboards

system that extracts a variable number of keyframes from each detected shot, depending on the visual content variation. The first keyframe extracted is selected by definition, N (i.e., the window size used for the shot boundary detection) frames away after a detected transition, in order to verify that the selected image doesn’t belong to a gradual effect. Next, we introduced a leap-extraction method that consider for analysis only the images located at integer multipliers of window size and not the entire video flow as [7], and [8]. The current frames are compared with the existing shot keyframes set already extracted. If the visual dissimilarity (chi-square distance of HSV color histograms) between the analyzed frames is significant (above a pre-establish threshold), the current image is added to the keyframe set.

The resulting storyboard still may contain useless blank or rainbow frames. In order to eliminate them, we introduce an additional post-processing step that eliminates all images from the selected set of keyframes assuring that the static story board captures all informational content of the original movie while discarding irrelevant images. To check if a selected keyframe is useless we computed its interest points based on SIFT descriptor. A rainbow or blank frame is detected and removed if the resulted number of keypoints is zero.

IV. SCENE SEGMENTATION

A. Related work In the recent years various methods to partition video into

scenes have been introduced. In [8], the authors transform the detection problem into a graph partition task where each shot is represented as a node being connected with the others through edges based on the visual similarity. In [9], the detection is based on the concept of logical story units and inter-shot dissimilarity measure. A logical story unit gathers shots with a similarity value above a pre-established threshold. Different approaches [10], [11] propose to use audio features and low level descriptors in order to detect scenes separately based on both characteristics. In this case, two types of boundaries are detected and used to make a final decision. A scene detection technique based on audio features it is also presented in [12]. The principle here consists of merging shots based on distinctive factors such as dialogs and audio segments. In [13], color and motion information are integrated in order to make a decision.

In [14], authors attempt to identify and understand the semantic content of a video by dividing the scenes in three classes: serial; parallel with interacting events and parallel with simultaneous events. In [15], three different types of movie scenes are introduced: a serial event scene, a parallel event scene and a traveling scene. A mosaic approach is introduced in [6] that use information specific to some camera setting or physical location to determine boundaries. While more recent techniques [16][17] apply concepts as temporal constraints and visual similarity.

B. Proposed approach The proposed method exploits the extracted keyframes in

order to form scenes. A scene is defined as a collection of shots that present the same theme and share similar coherence in action, space and time [18]. The concept of temporal constraint hierarchical clustering is here introduced.

The temporal constraint hierarchical clustering consists of iteratively merging shots falling into a temporal analysis window and satisfying certain clustering criteria. The width of the temporal analysis window, denoted by dist, is set as proportional to the average number of frames per shot:

shotsofnumberTotal

framesofnumberTotaldist

⋅= α . (1)

with � a user-defined parameter. We consider further that a scene is completely described by its constituent shots:

{ } lN

p

niflN

pplsSs:lS plipll

11)

1,)( ,,,

=���

��

=→=��

���= .(2)

157

Page 3: A complete framework for temporal video segmentationstatic.tongtianta.site/paper_pdf/a0b40f3c-994e-11e9-88b0-00163e08… · for stationary shots, one frame does not provide an acceptable

where Sl denotes the lth video scene, Nl theincluded in scene Sl, sl,p the pth shot in scenekeyframe of shot sl,p.

Our scene change detection algorithmclustering, can be resumed into the following

Step 1: Initialization – The first shautomatically assigned to the first scene S1. set to 1.

Step 2: Shot to scene comparison – Consiscur the first shot which is not yet assigned to detected scenes. Detect the sub-set Ω of sceand located at a temporal distance inferior Compute the visual similarity between theeach scene Sk in the sub-set Ω , as describedequation:

kkcurk n

SsimSceneShotSS =Ω∈∀ ),(,

where ncur is the number of keyframes of thand nmatched represents the number of matchedscene Sk. A keyframe from scene Sk is matched with a keyframe from shot scur if thmeasure between the two keyframes is supeTgroup. Let us note that a keyframe from thmatched with multiple frames from the curren

Finally, the current shot scur is identified tscene Sk if:

),( ≥curk sSimSceneShotS

In this case, the current shot scur will bscene Sk. In the same time, all the shots beshot and the scene Sk will be also affectedmarked as neutralized. Let us note that theinitially belonged such neutralized shots disapthat they are merged to the scene Sk). Thscenes is then updated.

The neutralization process makes it possimost representative shots for a current scene the remaining non-neutralized shots. In this wof outlier shots which might correspond digressions from the main action in the cominimized.

Figure 1. Neutralizing shots (marked with red) based

e number of shots e Sl, and fl,p,i the ith

m based on shot steps:

hot of a film is Scene counter l is

ider as current shot any of the already

enes anterior to scur to parameter dist.

e current shot and d by the following

curkpk

matched

nNn

⋅⋅, , (3)

he considered shot d keyframes of the considered to be

he visual similarity erior to a threshold e scene Sk can be nt shot.

to be similar to the

5.0 . (4)

be clustered in the etween the current d to scene Sk and e scenes to which ppear (in the sense

he list of detected

ible to identify the (Fig. 1), which are way, the influence to some punctual

onsidered scene is

d on visual similarity

Step 3: Shot by shot compais highly similar (i.e., with a simthan the grouping threshold Tgr

the sub-set Ω determined at scorresponding scene together wscur is found highly similar toscene which is the most far awretained.

Figure 2. Shots characteriza

Both the current shot and aunmarked and for the follocontribute as normal, non-neutensures that shots highly simprevious scene to be groupedreducing the number of false ala

Step 4: Creation of a newdoes not satisfy any of the siminew scene, including scur, is cre

Step 5: Refinement - At theshot are attached to the adjamaximum similarity value. In tgrouped with the following one

We selected as similarity keyframe the total number determined based on SIFT descput in correspondence by technique.

The grouping threshold Tdepending on the input video stthe average number of interekeyframe and all anterior keydistance smaller then dist.

Neutralized shot

arison – If the current shot (scur) milarity at least two times bigger roup) with a shot of any scene in

step 2 then scur is merged in the with all the intermediate shots. If o multiple other shots, than the way from the considered shot is

ation based on visual similarity

all its highly similar matches are owing clustering process will tralized shots (Fig. 2). This step milar with other shots in the d into this scene and aims at arms.

w scene - If the current shot scur ilarity criteria in steps 2 and 3, a

eated.

e end, scenes including only one acent scenes depending on the the case of the first scene, this is e by default.

measure (3) for each extracted of common interest points

criptor [19]. The keypoints were using the KD-tree matching

Tgroup is adaptively established tream visual content variation as est points between the current yframes located at a temporal

Non-neutralized shot

158

Page 4: A complete framework for temporal video segmentationstatic.tongtianta.site/paper_pdf/a0b40f3c-994e-11e9-88b0-00163e08… · for stationary shots, one frame does not provide an acceptable

V. EXPERIMENTS AND RESULTS The validation of our method is done on a corpus of 6

sitcoms and 6 Hollywood movies. As evaluation metrics, we have considered the traditional Recall (R) and Precision (P) measures, defined as follows:

MDD

DR+

= , FAD

DP+

= . (5)

where D is the number of the detected shot boundaries, MD is the number of missed detections, and FA the number of false alarms. Ideally, both recall and precision should be equal to 100%, which correspond to the case where all existing shot boundaries are correctly detected, without any false alarm.

For the sitcoms the scenes boundaries ground truth have been established by human observers (Table 1), while for the Hollywood videos we use for comparison the DVD chapters identified by movie produces. The Hollywood videos allow us to make a complete evaluation of our method with the state of the art algorithms [8], [16], [17], which yield recall and precision rates of 55% and 73%, respectively. For this corpus, our precision and recall rates are of 63% and 93% respectively, which clearly demonstrates the superiority of our approach in both parameters (Table 2).

TABLE I. SCENE DETECTION BASED ON SIFT FEATURES

Video name Ground

truth D FA MD

P (%)

R(%)

F1 (%)

Sienfeld 24 19 1 5 95.00 79.16 86.36

Two and a half men

21 18 0 3 100 81.81 90.00

Prison Break 39 31 3 8 91.17 79.48 84.93

Ally McBeal 32 28 11 4 71.79 87.50 78.87

Sex and the city

20 17 0 3 100 85.00 91.89

Friends 17 17 7 0 70.83 100 82.92TOTAL 153 130 22 23 85,52 84,96 85,23

TABLE II. DVD CHAPTERS DETECTION BASED ON SIFT FEATURES

Video name DVD

chapters D FA MD

P (%)

R(%)

F1 (%)

5th element 37 36 39 1 48.00 97.29 64.28

Ace-Ventura 31 28 12 3 70.00 90.32 78.87

Lethal Weapon 4

46 44 41 2 52.87 95.65 68.09

Terminator 2 58 51 7 7 87.93 87.93 87.93

The Mask 34 32 12 2 71.11 94.11 81.01

Home Alone 2

29 28 22 1 56.00 96.55 70.88

TOTAL 235 219 133 16 62.42 93.19 74.76D – Detected ; FA – False Alarms ; MD – Missed Detected; P – Precision ; R – Recall ; F1 – F1 norm

In this case, the parameter � has been set to a value of 7.

We focused next on investigating the impact different temporal constraints lengths (1) have on the proposed scene detection method. In Fig: 3 we presented the precision, recall

and F1 score for various values of α parameter. As it can be observed, a value between 5 and 10 returns similar results in terms of the overall efficiency.

Figure 3. Preformance evaluation for different values of the α parameter in DVD chapters detection

When considering the problem of video semantic segmentation it is highly useful to develop appropriate user interfaces that can help to evaluate the detection performances. The proposed video structuring platform, so-called VICTORIA (VIdeo Characterization Of Retrieval and Indexing Applications) is illustrated in Fig. 4.

Figure 4. The structuring process on VICTORIA platform

Figure 5. Video segmentation with the proposed system for an XML data

file loaded on the platform

The user can select an arbitrary input video (Fig. 4) and launch the segmentation/structuring process. The outcome of

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 9 10

Precision

Recall

F1 norm

159

Page 5: A complete framework for temporal video segmentationstatic.tongtianta.site/paper_pdf/a0b40f3c-994e-11e9-88b0-00163e08… · for stationary shots, one frame does not provide an acceptable

the analysis is an XML file which specifies the shot and scene boundaries, as well as the temporal positions of the obtained keyframes. The adopted syntax is compliant with the MPEG-7 segment-based representation [20]. Finally, the system allows the user to hierarchically browse the movie at each structural level (Fig. 5).

VI. CONCLUSIONS AND PERSPECTIVES In this paper we have presented a novel scene

segmentation scheme which includes shot boundary detection, static summaries developement and scene identification methods.

Firstly the shot boundaries are detected using our previous technique based on graph partition method and nonlinear scale space filtering. Secondly, keyframes for each shot are selected adaptively using a fast leap extraction method.

Concerning the shot grouping into scenes, by exploiting the observation that shots belonging to the same scene have similar visual features, we have adopted a grouping method based on temporal constraints that uses adaptive thresholds based on the input video dynamics. At this point we introduce a novel measure to determine the similarity between a scene and a shot and a new concept of neutralized shots depending on their significance to the grouping process. The technique is superior to other state of the art methods and captures the global similarities rather the local ones. The experimental evaluation proposed validates our approach when using as similarity measure the total number of interest points extracted based on SIFT descriptor. The F1 measure, when applying our method on sitcoms, is around 85%, while for DVD chapter detection the efficiency is 76%.

Regarding further work we consider developing a complete system to organize huge amounts of image sequence without user intervention to facilitate browsing through data based on semantic labeling and event detection.

REFERENCES [1] R. Tapu, T. Zaharia, F. Preteux, “A scale-space filtering-based shot

detection algorithm”, IEEE 26-th Convention of Electrical and Electronics Engineers in Israel, IEEEI 2010, pp.919-923.

[2] Hanjalic, A., Zhang, H. J.: An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis, IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 8, 1999.

[3] Fu, X., Zeng, J.: An Improved Histogram Based Image Sequence Retrieval Method, Proceedings of the International Symposium on Intelligent Information Systems and Applications , 015-018, 2009.

[4] Zhang, H., Wu, J., Zhong, D., Smoliar, S. W.: An integrated system for content-based video retrieval and browsing, Pattern Recognition., vol. 30, no. 4, 643–658, 1999.

[5] Girgensohn, A., Boreczky, J.: Time-Constrained Keyframe Selection Technique, in IEEE Multimedia Systems, IEEE Computer Society, Vol. 1, pp.756- 761, 1999.

[6] Aner, A., Kender, J. R.: Video summaries through mosaic-based shot and scene clustering, in Proc. European Conf. Computer Vision, 388–402, 2002.

[7] G. Ciocca, R. Schettini, “Dynamic key-frame extraction for video summarization,” in Internet Imaging VI, vol. 5670 of Proceedings of SPIE, pp. 137–142, California, USA, January 2005.

[8] Z. Rasheed, Y. Sheikh, and M. Shah, “On the use of computable features for film classification,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 1, pp. 52–64, Jan. 2005.

[9] Hanjalic A, Lagendijk RL, Biemond J, Automated high-level movie segmentation for advanced video-retrieval systems, IEEE Circuits Syst Video Technol 9, 580–588, 1999.

[10] Ariki Y., Kumano M., Tsukada K.: Highlight scene extraction in real time from baseball live video. Proceeding on ACM International Workshop on Multimedia Information Retrieval, 209–214, 2003.

[11] J. Huang, Z. Liu, and Y. Wang, “Integration of audio and visual information for content-based video segmentation,” in Proc. IEEE Int. Conf. Image Process., Chicago, IL, pp. 526–529, 1998.

[12] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Scene determination based on video and audio features,” Multimedia Tools and Applications, vol. 15, no. 1, pp. 59–81, 2001.

[13] Ngo C.W., Zhang H.J.: Motion-based video representation for scene change detection. Int J Comput Vis 50(2), (2002) 127–142.

[14] Zhao YJ, Wang T et al., “Scene segmentation and categorization using NCuts”, IEEE proceeding on Computer vision and pattern recognition, 343–348, 2007.

[15] Tavanapong W, Zhou J, “Shot clustering techniques for story browsing”, IEEE Trans Multimed 6 (4):517–526, 2004.

[16] V. Chasanis, A. Kalogeratos, A. Likas, "Movie Segmentation into Scenes and Chapters Using Locally Weighted Bag of Visual Words", Proceeding of the ACM International Conference on Image and Video Retrieval (CIVR 2009), Santorini, July 2009.

[17] S. Zhu, Y. Liu, “Video scene segmentation and semantic representation using a novel scheme,” Multimedia Tools and Applications, vol. 42, no. 2, pp. 183-205, 2009.

[18] B. Truong, S. Venkatesh, “Video abstraction: A systematic review and classification”, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.3 n.1, p.3-es, February 2007.

[19] Lowe, D.: Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 1-28, 2004.

[20] ISO/ IEC 15938-5, Information technology - MultimediaContent Description. Interface - Part 5: Multimedia Description Schemes. 2003.

160