Yu-Gang Jiang, Yanran Wang , Rui Feng Xiangyang Xue , Yingbin Zheng , Hanfang Yang

PowerPoint

Yu-Gang Jiang, Yanran Wang, Rui FengXiangyang Xue, Yingbin Zheng, Hanfang Yang

Understanding and Predicting Interestingness of VideosFudan University, Shanghai, ChinaAAAI 2013, Bellevue, USA, July 2013Good afternoon. My name is Yanran Wang, from Fudan University, Shanghai, China. The work I will present is 1Large amount of videos on the InternetConsumer Videos, advertisement

Some videos are interesting, while many are not

Motivation

More interestingLess interestingTwo Advertisements of digital products

1. As we all know, videos on the Internet is growing explosively. There are different types of videos such as consumer videos, advertisements, short movies and etc. Among these videos, some are very interesting, while many are not.2. Let me show you two examples. They are both ads of digital products.3. Apparently,. I guess most of you may find the left with cute little girl is more attractive and interesting, while the right is a little boring.4. So we come up with the idea If a computational method can identify more interesting videos automatically, it must be very useful in many applications.

2ApplicationsWeb Video SearchRecommendation System...

Like: Web Video Search, Video Recommendation SystemIf the website can offer users interesting and attractive videos, it must can improve user satisfaction. Also the prediction of the interestingness of ads may help enterprise make good ads and better promote their products.

In Web video search, among the same relevent videos, rank interesting videos higher may 3Predicting Aesthetics and Interestingness of ImagesDatta et al. ECCV, 2006; Dhar et al. CVPR, 2011; N. Murray et al. CVPR, 2012

We are the first to explore the interestingness of Videos

Related Work

More interestingLess interesting

There is a few work about images, predicting the Interestingness of Images. However, No studies about videos have been done yet. And We conduct a pilot work on this topic and get different findings from the prediction of images4Flickrsource: Flickr.com Consumer Videovideos: 1200 (20 hrs in total)

YouTubesource: Youtube.com Advertisement Videovideos: 420 (4.2 hrs in total)

Two New DatasetsBecause there is No public dataset about this study. We construct new video datasets. In our dataset, each video has its interestingness label.Here is a brief look. We build two datasets from Flickr and YouTube respectively. The Flickr dataset consists of 1200 videos with 15 categories(like basetball, bird). The YouTube dataset is made of 420 vidoes with 14 categories, that is 30 videos per category.

5Collected by 15 interestingness-enabled queriesTop 10% of 400 as interesting videos; Bottom 10% as uninteresting80 videos per category/query

Flickr DatasetHere is how we build the Flickr dataset. In Flickr Dataset, Take the basketball category for example. We enter the query on Flickr Website: basketball and let videos sort by interestingness(it is a function provided by Flickr), then we select the top 10% of 400 videos as interesting samples , bottom 10% as uninteresting. In this way, we get 80 videos to form the basketball category. The rest categories are being done in the same manner.

We download 400 videos, top 10%

6Collected by 15 ads queries on YouTube10 human assessors (5 females, 5 males)Compare video pairs

Annotation Interface

YouTube DatasetGeneral observation: videos with humorous stories, attractive background music, or better professional editing tend to be more interesting1.As with YouTube, we first download 420 ad videos covering ads of a large number of products and services. 2. After collecting the YouTube data, we invited ten human assessors to complete the annotation. They were shown a pair of videos, and asked to tell which one is more interesting. Here is the annotation interface of the investigation. And notice that the 10 assessors consist of (5 female, 5 male) to ensure the reliability of our experiment.3. Through the building of these 2 dataset, we find that videos with funny stories, beautiful music, and professional editing tend to be interesting!7Aim: compare two videos and tell which is more interesting

Visual featuresAudio featuresHigh-level attribute featuresRanking SVMresultsMulti-modal fusion

vs.Our Computational Framework1. After the construction of our datasets, we are now able to develop and evaluate computational models.2. As it is difficult to quantify the degree of interestingness of a video, our model is to compare the interestingness of two videos, identify which is more interesting.3. To this end, First, we select a great number of features from different aspects like visual, audio, semantic features to describe the content of videos.4. Second, since many features are complementary, we fuse these multiple features and evaluate their performance. 5. Finally, we adopt Ranking SVM to get the prediction models. This Model outputs the ranking order of the input video pair according to their interestingness.6. Well, Lets come to the feature part first.8FeatureVisual featuresColor HistogramSIFTHOGSSIMGIST Audio featuresMFCCSpectrogram SIFTAudio-SixHigh-level attribute featuresClassemesObjectBankStyle Flower, Tree, Cat, FaceRule of ThirdsVanishing PointSoft FocusMotion BlurShallow DOF

1.To best describe the videos, we study a large number of features. Look at the table, the first row is 5 visual features, second is 3 audio ones , and the last row is 3 high-level attribute features.2. As time limited, I only introduce the 3 attribute features.3. Style is used to describe 14 photographic roles(like Rule of Thirds, Vanishing points). Take this photo for example, according to Thirds , if the important object is placed along these lines or their intersections, this photo tends to be more vivid and attractive). The style feature has been demonstrated to be userful in predicting interestingness of images. 3. Classemes&Objectbank: these 2 are to predict the existence of semantic concepts and objects (flowers trees cats...)

9Prediction Ranking SVM trained on our datasetChi square kernel for histogram-like featuresRBF kernel for the others2/3 for training and 1/3 for testing

EvaluationPrediction accuracy The percentage of correctly ranked test video pairs

Prediction & Evaluation1. Next, I will mention some details about our models2.The model is trained on our dataset, we use 2/3 of the videos for training and 1/3 for testing.3. when fusing features, we adopt kernel-level fusion, and equal weights. 4. The method of evaluation is accuracy: that is the percentage of correctly ranked test video pairs10Prediction Accuracies(%)Visual Feature ResultsFlickrYouTube74.567.067.176.668.0Overall the visual features achieve very impressive performanceAmong the 5 features, SIFT and HOG performs best.3. While Color Histogram is the worst (68.1% on Flickrand 58.0% on YouTube), It may indicates that though color is vey important in many human perception tasksit is probablynot an important clue in evaluating interestingness11Audio Feature ResultsFlickrYouTube65.764.874.7Prediction Accuracies(%)Then we experiment with the three audio-based features.2. All of the audio features are discriminative for this task and are also very complementary. The combination of all of the 3 features performs best with 76 on Flickr.3. Thus we can get the conclusion that audio channel conveys very useful information for human perception of interestingnessProblem: Spectrogram Sift = SpecSift; Audio; 12Prediction Accuracies(%)Attribute Feature ResultsFlickrYouTube64.374.864.5Different from predicting Image Interestingness1.The last part is attribute feature.2.They do not work as well as we expected.3. And we find that Classemes and ObjectBank are significantly better than the Style Attributes. It is a very interesting observation! because in the prediction of image interestingness, style is demonstrated to be effective.13VisualAttributePrediction Accuracies(%)Visual+Audio+Attribute ResultsFlickrYouTube71.778.676.668.02.6% 5.4% AudioVisual+Audio+AttributeVisual+Audio1.The last part is attribute feature.2.They do not work as well as we expected.3. And we find that Classemes and ObjectBank are significantly better than the Style Attributes. It is a very interesting observation! because in the prediction of image interestingness, style is demonstrated to be effective.14Conducted a pilot study on video interestingness

Built two datasets to support this study Publicly available at: www.yugangjiang.info/research/interestingness

Evaluated a large number of featuresVisual + audio features are very effectiveA few features useful in image interestingness do not work in video domain (e.g., Style Attributes)

SummaryTo summary, The visual and audio and attribute features are all useful in predicting 2. we think the main contribution we make is: 3: First we conduct a pilot study on video interestingness.4. Second constuct new database. These are valuable resources and may be helpful for stimulating future research on this topic.5. Finally, by evaluating large number of features, We get very interesting observations like:visual and audio features are very effective.Some features that are useful in image interestingness is not effective in video domain.like the style.

Visual audio, attribute features all effective.Audio+Visual improve . Complenmentary.15Thank you !Datasets are available at:www.yugangjiang.info/research/interestingness1.Well thats the brief introduction of our work.

16

Yu-Gang Jiang, Yanran Wang , Rui Feng Xiangyang Xue , Yingbin Zheng , Hanfang Yang

Documents

Transcript of Yu-Gang Jiang, Yanran Wang , Rui Feng Xiangyang Xue , Yingbin Zheng , Hanfang Yang