ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: Multimodal Video Classification
-
Upload
mediaeval2012 -
Category
Technology
-
view
448 -
download
2
description
Transcript of ARF @ MediaEval 2012: Multimodal Video Classification
![Page 1: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/1.jpg)
~ Multimodal Video Classification ~
University POLITEHNICA of Bucharest
*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.
Austrian Research Institute for Artificial Intelligence
Bogdan IONESCU*1,3
Ionuț MIRONICĂ1
Klaus SEYERLEHNER2
Peter KNEES2
Jan SCHLÜTER4
Markus SCHEDL2
Horia CUCU1
Andi BUZO1
Patrick LAMBERT3
ARF (Austria-Romania-France) team
1 2 3 4
![Page 2: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/2.jpg)
2
Presentation outline
MediaEval - Pisa, Italy, 4-5 October 2012 1/16
• The approach
• Video content description
• Experimental results
• Conclusions and future work
![Page 3: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/3.jpg)
3
The approach
MediaEval - Pisa, Italy, 4-5 October 2012 2/16
video database
> challenge: find a way to assign (genre) tags to unknown videos;
> approach: machine learning paradigm;
train
classifier
unlabeled data
web food autos
…label data
labeled data
tagged video database
![Page 4: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/4.jpg)
4
The approach: classification
MediaEval - Pisa, Italy, 4-5 October 2012 3/16
> the entire process relies on the concept of “similarity” computed between content annotations (numeric features),
objective 1: go multimodal (truly)
visual audio text
objective 2: test a broad range of classifiers and descriptor combinations;
> this year focus is on:
![Page 5: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/5.jpg)
5
Video content description - audio
MediaEval - Pisa, Italy, 4-5 October 2012 4/16
[Klaus Seyerlehner et al., MIREX’11, USA]
average
median
variance
...
e.g. 50% overlapping
block-level audio features (capture also local temporal information)
• Spectral Pattern, ~ soundtrack’s timbre;
• delta Spectral Pattern, ~ strength of onsets;
• variance delta Spectral Pattern, ~ variation of the onset strength;
• Logarithmic Fluctuation Pattern, ~ rhythmic aspects;
• Spectral Contrast Pattern, ~ ”toneness”;
• Correlation Pattern, ~ loudness changes;
• Local Single Gaussian model,~ timbral;
• George Tzanetakis model,~ timbral;
![Page 6: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/6.jpg)
6
Video content description - audio
MediaEval - Pisa, Italy, 4-5 October 2012 5/16
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
• Linear Predictive Coefficients,
• Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
• Zero-Crossing Rate,
+ variance of each feature over a certain window.
• spectral centroid, flux, rolloff, and kurtosis,
standard audio features (audio frame-based)
f1 fn…f2
globalfeature
= mean & variance
time
+var{f2} var{fn}
![Page 7: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/7.jpg)
7
Video content description - visual
MediaEval - Pisa, Italy, 4-5 October 2012 6/16
[OpenCV toolbox, http://opencv.willowgarage.com]
MPEG-7 & color/texture descriptors(visual frame-based)
• Local Binary Pattern,
• Autocorrelogram,
• Color Coherence Vector,
• Color Layout Pattern,
• Edge Histogram,
• Scalable Color Descriptor,
• Classic color histogram,
• Color moments.
time
f1 fn…
globalfeature
=mean &
dispersion & skewness & kurtosis & median &
root mean square
f2
![Page 8: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/8.jpg)
8
Video content description - visual
MediaEval - Pisa, Italy, 4-5 October 2012 7/16
[OpenCV toolbox, http://opencv.willowgarage.com]
feature descriptors(visual frame-based)
• Histogram of oriented Gradients (HoG)~ counts occurrences of gradient orientation in localized portions of an image (20º per bin)
• Harris corner detector
• Speeded Up Robust Feature (SURF)
feature points (e.g. Harris)
image source http://www.ifp.illinois.edu/~yuhuang
![Page 9: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/9.jpg)
9
Video content description - text
MediaEval - Pisa, Italy, 4-5 October 2012 8/16
TF-IDF descriptors(Term Frequency-Inverse Document Frequency)
> text sources: ASR and metadata,
1. remove XML markups,
2. remove terms <5%-percentile of the frequency distribution,
3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,
4. for each document we represent the TF-IDF values.
![Page 10: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/10.jpg)
10
avg. Fscore (over all genres)
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 9/16
> classifiers from Weka (Bayes, lazy, functional, trees, etc.),
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
> cross-validation (train 50% – test 50%),
- visual descriptors capabilities 30%±10%,
- best LBP+CCV+histogram (Fscore=41.2%).
- using more visual is not more accurate than using few,
![Page 11: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/11.jpg)
11
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 10/16
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
avg. Fscore (over all genres)
> cross-validation (train 50% – test 50%),
- proposed block-based better than standard (by ~10%),
- audio still better than visual (improvement ~6%),
![Page 12: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/12.jpg)
12
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 11/16
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
avg. Fscore (over all genres)
- best performance ASR LIMSI + metadata (Fscore=68%).
> cross-validation (train 50% – test 50%),
- ASR from LIMSI more representative than LIUM (~3%),
![Page 13: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/13.jpg)
13
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 12/16
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
avg. Fscore (over all genres)
- increasing the number of modalities increases the performance.
- audio-visual close to text (ASR) for the automatic descriptors,
> cross-validation (train 50% – test 50%),
![Page 14: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/14.jpg)
14
Experimental results: official runs (9,550 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 13/16
> train on devset, test on testset (SVM linear),
Run1 LBP+CCV+
hist + audio block-based
Run2
TF-IDF on ASR LIMSI
Run3
audio block-based + LBP + CCV + hist +
TF-IDF on ASR LIMSI
Run4
audio
block-based
Run5
TF-IDF on metadata +
ASR LIMSI
MediaEval
2011
MAP 10.3%
MediaEval
2011
MAP 12%
metadata
![Page 15: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/15.jpg)
15
Experimental results: official runs (9,550 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 14/16
> genre MAP for Run 5: TF-IDF on ASR + metadata,
Run 1: visual + audio religion
71%gaming
71%autos52%
environment50%
![Page 16: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/16.jpg)
16
Conclusions and future work
MediaEval - Pisa, Italy, 4-5 October 2012 15/16
> classification adapts to the corpus – changing the corpus will change the performance;
> future work:
more elaborated late-fusion ?
pursue tests on the entire data set;
perhaps more elaborated Bag-of-Visual-Words.
Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and Prof. Nicu Sebe from University of Trento for their support.
> how far can we go with ad-hoc classification without human intervention?
> audio-visual descriptors are inherently limited;
![Page 17: ARF @ MediaEval 2012: Multimodal Video Classification](https://reader036.fdocuments.net/reader036/viewer/2022070303/54934894b479594f4a8b4827/html5/thumbnails/17.jpg)
17
thank you !
MediaEval - Pisa, Italy, 4-5 October 2012 16/16
any questions ?